Debugging a Solaris fault manager fault

I recently debugged an issue where a host panicked with the following message:

Apr 3 04:41:44 pluto.prefetch.com genunix: [ID 663943 kern.notice] Unrecoverable Machine-Check Exception

These errors are typically generated due to CPU or memory faults, but on this specific machine nothing was being displayed when I checked the fault and errors logs. Upon closer inspection, it looked like the fault manager wasn’t running and had transitioned into the maintenance state:

$ svcs -a | grep fmd
maintenance 4:46:25 svc:/system/fmd:default

After poking around Sunsolve, I noticed that there were a number of issues that can cause fmd to enter the maintenance state. In this specific case, the daemon was core dumping at startup so I had a core file readily available to help debug the source of the problem. Upon inspecting the panic string, it turned out that we were bumping into bug #6672662. This bug occurs because one of the libtopo maps is incorrectly named. Fixing this issue is as simple as renaming the “storage-hc-topology.xml” file to “storage-hc-topology.xml.org”:

$ mv /usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml \
/usr/platform/i86pc/lib/fm/topo/maps/storage-hc-topology.xml.org

Once the topology map was renamed, fmd started up fine:

$ svcadm clear fmd

This also allowed the fault manager to kick into action and make a diagnosis of the faulty hardware:

$ fmadm faulty -a

   STATE RESOURCE / UUID 
-------- ---------------------------------------------------------------------- 
 faulted cpu:///cpuid=10 
         f202538b-9dbe-6f59-bd04-87f9a0954ce8 
-------- ---------------------------------------------------------------------- 

$ fmdump
TIME UUID SUNW-MSG-ID
Apr 03 16:50:16.0797 f202538b-9dbe-6f59-bd04-87f9a0954ce8 AMD-8000-67

The fmdump data revealed that we had a faulty DIMM, and provided and bank and DIMM number of the defective hardware. Debugging issues can be a lot of fun, especially when you get to the bottom of a system panic! :)

SCSI Enclosure Services

Eric Schrock has done some really cool work with integrating disk (SMART) /platform monitoring (IPMI) information into Opensolaris.   Just recently, he has extended FMA with a new technology called SES (SCSI Enclosure Services) into build 93 of OpenSolaris.

This looks like some really cool stuff.  The following was taken directly from his blog on the examples of using the new fmtopo utility to map out an external storage array.

# /usr/lib/fm/fmd/fmtopo
...

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:serial=2029QTF0000000002:part=Storage-J4400:revision=3R13/ses-enclosure=0

hc://:product-id=SUN-Storage-J4400:chassis-id=22029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:part=123-4567-01/ses-enclosure=0/psu=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=2

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/fan=3

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0386:part=375-3584-01/ses-enclosure=0/controller=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=2029QTF0811RM0074:part=375-3584-01/ses-enclosure=0/controller=1

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=/ses-enclosure=0/bay=1

...

# fmtopo -V '*/ses-enclosure=0/bay=0/disk=0'
TIME                 UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
    ASRU              fmri      dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    label             string    SCSI Device  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0809QCK012
    server-id         string
  group: io                             version: 1   stability: Private/Private
    devfs-path        string    /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    devid             string    id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
    phys-path         string[]  [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
  group: storage                        version: 1   stability: Private/Private
    logical-disk      string    c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
    manufacturer      string    SEAGATE
    model             string    ST37500NSSUN750G 0720A0PC3X
    serial-number     string    5QD0PC3X
    firmware-revision string       3.AZK
    capacity-in-bytes string    750156374016

Monitoring the IPMI system event log

If you have a relatively recent server, your machine most likely supports IPMI. One technology that makes IPMI extremely useful is the baseboard management controller (BMC), which is an out-of-band controller that monitors the health of your server platform. Health monitoring is accomplished by distributing sensors throughout the server, and feeding the data these sensors collect back to the BMC. If the BMC detects a fault condition, it can log an error to the system event log.

The event log can be monitored by a number of IPMI software packages. Once such package is ipmitool, which provides the ipmievd daemon just for this purpose. If you are running a recent version of Solaris 10*, you probably already have the IPMI software and the ipmievd daemon installed. You can use the following commands to check:

$ pkginfo | grep ipmi
system SUNWipmi ipmitool, (usr)
system SUNWipmir ipmitool, (root)

If the software is installed, you can use the svcadm utility to enable the ipmievd domain:

$ svcadm enable svc:/network/ipmievd:default

Once the ipmievd service is enabled, you can use the ps and svcs commands to verify that the daemon is running:

$ svcs -a | grep ipmi
online 0:25:52 svc:/network/ipmievd:default

$ ps -ef | grep ipmi
root 328 1 0 00:25:53 ? 0:01 /usr/lib/ipmievd sel


If the daemon starts up, it will periodically poll the BMC system event log. If ipmievd detects an error condition, it will log a message to syslog. This message will contain details on the fault, which can be used to help determine that a server is sick. Since FMA currently doesn’t do platform health monitoring (the sensor project will fix this), ipmievd is able to step in and fill this role for the time being. Nice!

* This blog post assumes you are running Solaris 10 update 4 with patch 119765-06.

Getting notified when hardware breaks

With the introduction of Solaris 10, the Solaris kernel was modified and userland tools were added to detect and report on hardware faults. The fault analysis is handled by the Solaris fault manager, which currently detects and responds (the kernel can retire memory pages, CPUs, etc. when it detects faulty hardware) to failures in AMD and SPARC CPU modules, PCI and PCIe buses, memory modules, disk drives and eventually Intel Xeon processors and system sensors (e.g., fan speed, thermal sensors, etc.).

To see if the fault manager has diagnosed a faulty component on a Solaris 10 or Nevada host, the fmadm utility can be run with the “faulty” option:

$ fmadm faulty

   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

The fmadm output includes the suspect component, the state of the component and a unique identifer to identify the fault. Since hardware faults can lead to system and application outages, I like to configure the FMA SNMP agent to send an SNMP trap with the hardware fault details to my NMS station, and I also like to configure my FMA notifier script to send the hardware fault details to my blackberry. The emails that are generated by the fmanotifier script look similar to the following:

From matty@lucky Sat Aug 18 14:58:29 2007
Date: Sat, 18 Aug 2007 14:58:29 -0400 (EDT)
From: matty@lucky
To: root@lucky
Subject: Hardware fault on lucky

The fault manager detected a problem with the system hardware.
The fmadm and fmdump utilities can be run to retrieve additional
details on the faults and recommended next course of action. 

Fmadm faulty output:

   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

I absolutely adore FMA, and wish there was something similar for Linux. Hopefully the Linux kernel engineers will provide similar functionality in the future, since this greatly simplifies the process of identifying broken hardware.

Solaris SMART support is finally becoming a reality!!

A while back I wrote a blog entry about the lack of SMART support in Solaris. Just recently, Eric Schrock added a FMA disk-transport diagnosis engine, which provides generic SMART monitoring as part of the base operating system. The disk-transport diagnosis engine currently only supports SATA disk drives, but SCSI support is right around the corner! This is exciting news, and I am stoked that SMART support is finally becoming a reality!!!!