I recently debugged an issue where a host panicked with the following message:
Apr 3 04:41:44 pluto.prefetch.com genunix: [ID 663943 kern.notice]
Unrecoverable Machine-Check Exception
These errors are typically generated due to CPU or memory faults, but on this specific machine nothing was being displayed when I checked the fault and errors logs. Upon closer inspection, it looked like the fault manager wasn't running and had transitioned into the maintenance state:
$ svcs -a | grep fmd
maintenance 4:46:25 svc:/system/fmd:default
After poking around Sunsolve, I noticed that there were a number of issues that can cause fmd to enter the maintenance state. In this specific case, the daemon was core dumping at startup so I had a core file readily available to help debug the source of the problem. Upon inspecting the panic string, it turned out that we were bumping into bug
6672662. This bug occurs because one of the libtopo maps is
incorrectly named. Fixing this issue is as simple as renaming the "storage-hc-topology.xml" file to "storage-hc-topology.xml.org":
$ mv /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml.org
Once the topology map was renamed, fmd started up fine:
$ svcadm clear fmd
This also allowed the fault manager to kick into action and make a diagnosis of the faulty hardware:
$ fmadm faulty -a
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=10
f202538b-9dbe-6f59-bd04-87f9a0954ce8
-------- ----------------------------------------------------------------------
$ fmdump
TIME UUID SUNW-MSG-ID
Apr 03 16:50:16.0797 f202538b-9dbe-6f59-bd04-87f9a0954ce8 AMD-8000-67
The fmdump data revealed that we had a faulty DIMM, and provided and bank and DIMM number of the defective hardware. Debugging issues can be a lot of fun, especially when you get to the bottom of a system panic! :)