Locating hardware faults on Linux servers


My friend Mike sent me a link to a Linux predictive failure post, which describes using the mcelog utility to check for machine check exceptions (these are hardware faults registered by the CPU). The utility discussed in the post (mcelog) is pretty sweet, and provides a portion of the capabilities that are currently available in the Solaris FMA architecture. The mcelog utility ships with several distributions, and can also be installed from various network repositories:

$ yum install mcelog

$ rpm -q -a | grep mcelog
mcelog-0.7-1.22.fc6

The mcelog package will add an hourly cron job to /etc/cron.hourly to check for new MCEs. If mcelog locates a MCE, an entry similar to the following will be written to /var/log/mcelog:

$ less /var/log/mcelog

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC 1157b0af355f7d
MISC c008064f00000000 ADDR 40db12ae0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 7273
bit46 = corrected ecc error
bit59 = misc error valid
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9c39c00072080a13 MCGSTATUS 0

If you would prefer to route fault messages to a central location for processing, you can add the “–syslog” option to the mcelog cron job. This is an awesome utility, and should simplify locating hardware errors (especially if this gets combined with memtest86+) on my various Linux hosts.

This article was posted by Matty on 2009-06-11 08:56:00 -0400 -0400