Locating hardware faults on Linux servers

My friend Mike sent me a link to a Linux predictive failure post, which describes using the mcelog utility to check for machine check exceptions (these are hardware faults registered by the CPU). The utility discussed in the post (mcelog) is pretty sweet, and provides a portion of the capabilities that are currently available in the Solaris FMA architecture. The mcelog utility ships with several distributions, and can also be installed from various network repositories:

$ yum install mcelog

$ rpm -q -a | grep mcelog
mcelog-0.7-1.22.fc6

The mcelog package will add an hourly cron job to /etc/cron.hourly to check for new MCEs. If mcelog locates a MCE, an entry similar to the following will be written to /var/log/mcelog:

$ less /var/log/mcelog

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC 1157b0af355f7d
MISC c008064f00000000 ADDR 40db12ae0
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 7273
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9c39c00072080a13 MCGSTATUS 0



If you would prefer to route fault messages to a central location for processing, you can add the “–syslog” option to the mcelog cron job. This is an awesome utility, and should simplify locating hardware errors (especially if this gets combined with memtest86+) on my various Linux hosts.

4 Comments

daybringer  on June 11th, 2009

Thanks, this rocks, I have one server that is acting bad but I can’t find any problems with it, it is saying that a fan is bad, tried swapping some out and nothing, this should help a lot.

Scott Davenport  on June 19th, 2009

Sun also puts out a Hardware Error Report & Decode (HERD) tool that does some additional processing on the mcelog. I’m not overly familiar with the tool, and I expect any thresholds it uses are tuned to Sun systems, but figured it’s worth noting on this thread.

https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=HERD-2.0-M-G-F@CDS-CDS_SMI

dumbilom  on July 28th, 2009

/clapping

well done Sun.. Doesn’t support Intel .. why even bother. Their latest release herd-3.0-1.blahblah.x86_64.rpm doesnt even handle it..

pfff

dumbilom  on July 29th, 2009

Actually… research would show its developed purely for AMD so I guess my bad.. not Suns fault.. Although they should atleast indicate what architecture it works on.. still clowns I guess.

Leave a Comment