Getting notified when hardware breaks


With the introduction of Solaris 10, the Solaris kernel was modified and userland tools were added to detect and report on hardware faults. The fault analysis is handled by the Solaris fault manager, which currently detects and responds (the kernel can retire memory pages, CPUs, etc. when it detects faulty hardware) to failures in AMD and SPARC CPU modules, PCI and PCIe buses, memory modules, disk drives and eventually Intel Xeon processors and system sensors (e.g., fan speed, thermal sensors, etc.).

To see if the fault manager has diagnosed a faulty component on a Solaris 10 or Nevada host, the fmadm utility can be run with the “faulty” option:

$ fmadm faulty

STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

The fmadm output includes the suspect component, the state of the component and a unique identifer to identify the fault. Since hardware faults can lead to system and application outages, I like to configure the FMA SNMP agent to send an SNMP trap with the hardware fault details to my NMS station, and I also like to configure my FMA notifier script to send the hardware fault details to my blackberry. The emails that are generated by the fmanotifier script look similar to the following:

From matty@lucky Sat Aug 18 14:58:29 2007
Date: Sat, 18 Aug 2007 14:58:29 -0400T(EDT):00-04:00
From: matty@lucky
To: root@lucky
Subject: Hardware fault on lucky

The fault manager detected a problem with the system hardware.
The fmadm and fmdump utilities can be run to retrieve additional
details on the faults and recommended next course of action.

Fmadm faulty output:

STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

I absolutely adore FMA, and wish there was something similar for Linux. Hopefully the Linux kernel engineers will provide similar functionality in the future, since this greatly simplifies the process of identifying broken hardware.

This article was posted by Matty on 2007-09-03 08:49:00 -0400 -0400