Debugging a Solaris fault manager fault

I recently debugged an issue where a host panicked with the following message:

Apr 3 04:41:44 genunix: [ID 663943 kern.notice] Unrecoverable Machine-Check Exception

These errors are typically generated due to CPU or memory faults, but on this specific machine nothing was being displayed when I checked the fault and errors logs. Upon closer inspection, it looked like the fault manager wasn’t running and had transitioned into the maintenance state:

$ svcs -a | grep fmd

maintenance 4:46:25 svc:/system/fmd:default

After poking around Sunsolve, I noticed that there were a number of issues that can cause fmd to enter the maintenance state. In this specific case, the daemon was core dumping at startup so I had a core file readily available to help debug the source of the problem. Upon inspecting the panic string, it turned out that we were bumping into bug #6672662. This bug occurs because one of the libtopo maps is incorrectly named. Fixing this issue is as simple as renaming the “storage-hc-topology.xml” file to “”:

$ mv /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml /usr/platform/i86pc/lib/fm/topo/maps/

Once the topology map was renamed, fmd started up fine:

$ svcadm clear fmd

This also allowed the fault manager to kick into action and make a diagnosis of the faulty hardware:

$ fmadm faulty -a

-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=10
-------- ----------------------------------------------------------------------

$ fmdump

Apr 03 16:50:16.0797 f202538b-9dbe-6f59-bd04-87f9a0954ce8 AMD-8000-67

The fmdump data revealed that we had a faulty DIMM, and provided and bank and DIMM number of the defective hardware. Debugging issues can be a lot of fun, especially when you get to the bottom of a system panic! :)

This article was posted by Matty on 2009-04-04 09:59:00 -0400 EDT