I recently debugged an issue where a host panicked with the following message:
Apr 3 04:41:44 pluto.prefetch.com genunix: [ID 663943 kern.notice] Unrecoverable Machine-Check Exception
These errors are typically generated due to CPU or memory faults, but on this specific machine nothing was being displayed when I checked the fault and errors logs. Upon closer inspection, it looked like the fault manager wasn’t running and had transitioned into the maintenance state:
$ svcs -a | grep fmd
maintenance 4:46:25 svc:/system/fmd:default
After poking around Sunsolve, I noticed that there were a number of issues that can cause fmd to enter the maintenance state. In this specific case, the daemon was core dumping at startup so I had a core file readily available to help debug the source of the problem. Upon inspecting the panic string, it turned out that we were bumping into bug #6672662. This bug occurs because one of the libtopo maps is incorrectly named. Fixing this issue is as simple as renaming the “storage-hc-topology.xml” file to “storage-hc-topology.xml.org”:
$ mv /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml.org
Once the topology map was renamed, fmd started up fine:
$ svcadm clear fmd
This also allowed the fault manager to kick into action and make a diagnosis of the faulty hardware:
$ fmadm faulty -a
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=10
f202538b-9dbe-6f59-bd04-87f9a0954ce8
-------- ----------------------------------------------------------------------
$ fmdump
TIME UUID SUNW-MSG-ID
Apr 03 16:50:16.0797 f202538b-9dbe-6f59-bd04-87f9a0954ce8 AMD-8000-67
The fmdump data revealed that we had a faulty DIMM, and provided and bank and DIMM number of the defective hardware. Debugging issues can be a lot of fun, especially when you get to the bottom of a system panic! :)