Getting To Know The Solaris Fault Management Architecture
Nov 14, 2007 Atlanta Solaris Users Group
Overview
- Tonight we are going to discuss the Solaris fault management architecture, as
well as several tools that can be used to notify you when hardware fails - I
plan to split my 60-minutes into two parts. The first part will provide an
overview of the software, and the second half will show FMA in action
The imperfect world
- Diagnosing hardware faults has historically been a royal PITA, and extremely
prone to error (anyone ever replaced a DIMM when the CPU or power supply were
actually faulty?) - With advances in hardware error correction and reporting,
you would think there would be an automated way to diagnose hardware faults,
isolate faulty components, and send alerts to let operational staff know that
a problem exists
Welcome to FMA
- The Fault Management Architecture (FMA) and Service Management Facility (SMF)
were introduced in Solaris 10 to allow systems to self heal themselves when
hardware and software fail
- FMA provides automated diagnosis of faulty hardware, and can take proactive
measures to correct (e.g., offline a CPU) hardware-related faults
- SMF provides automated diagnosis of software faults, and can take proactive
measures to correct (e.g., restart a process) software-related faults
- The remainder of this talk will focus on FMA
How does FMA work?
- Kernel sends error events to the fault manager daemon (fmd), which routes the
events to modules based on subscriptions - Two main types of modules:
- Diagnosis engines
- take the raw error telemetry events and provide automated problem diagnosis
based on the symptoms
- Agents
- respond to a given diagnosis by taking one or more actions (e.g., offline a
faulty CPU or memory page)
- When problems are diagnosed, the fault manager will log a fault diagnosis
message that contains a case id (represented by a UUID) which references the
problem, a description of the problem, and a link to a knowledge base
article that describes the problem and the set of actions that will be
required to fix the problem
How does FMA work? cont
- How does FMA work? (cont.) - The example below shows what a typical fault
diagnosis message looks like:
- SUNW-MSG-ID
- : SUN4U-8000-AC,
- TYPE
- : Fault,
- VER
- : 1,
- SEVERITY
- : Major
- EVENT-TIME
- : Thu Feb 26 18:08:26 PST 2004
- PLATFORM
- : SUNW,Sun-Fire-V440,
- CSN
- : -,
- HOSTNAME
- : boz
- SOURCE
- : cpumem-diagnosis,
- REV
- : 0.1
- EVENT-ID
- : 322fe6d5-fe14-6a73-b802-cc6c30b2afcd
- DESC
- : The number of errors associated with this CPU has exceeded acceptable
levels.
- Refer to
http://sun.com/msg/SUN4U-8000-AC for
more information.
- AUTO-RESPONSE
- : An attempt will be made to remove the affected CPU from service.
- IMPACT
- : Performance of this system may be affected.
- REC-ACTION
- : Schedule a repair procedure to replace the affected CPU.
What diagnosis engines are currently available?
- There are diagnosis engines available for a number of CPU architectures:
- UltraSPARC III and above
- UltraSPARC T1 and T2
- Intel Xeons
- AMD Opterons
- AMD Athlons
- Diagnosis engines are also available for PCI and PCI express buses, as well
as a number of lead node drivers (e.g., Ethernet adapters, HBA drivers,
etc.)
- A disk-transport diagnosis engine was recently added to Nevada to daignose
SATA and SCSI disk drives errors using SMART data
Which agents are currently available?
- AMD, Intel and SPARC agents are available to retire CPUs and memory pages
- Disk transport and I/O agents are available to retire disk drives and faulty
I/O devices
- The ZFS agent allows the ZFS file system to enable hot spares in response to
disk failures
- More agents to come …
What diagnosis engines and agents are coming to an opensolaris near you?
- Sensor project will provide fault diagnosis based on sensor data (e.g.,
increase fan speeds in response to excessive heat, offline disk drives due to
an unacceptable number of ECC errors, etc.)
- More leaf drivers will be hardened to send error telemetry data when they
detect faults
- Software diagnosis engines will be developed to diagnose software faults, and
take appropriate actions (e.g., restart a process that is using a page of
memory that had uncorrectable ECC errors)
- And potentially a lot more …
Viewing the active diagnosis engines and agents
- Viewing the active diagnosis engines and agents - The fmadm utilities “config”
option can be used to view the list of diagnosis engines and agents that are
active on a system:
- i
- $
- fmadm config
- MODULE VERSION STATUS DESCRIPTION
- cpumem-retire 1.1 active CPU/Memory Retire Agent
- disk-transport 1.0 active Disk Transport Agent
- eft 1.16 active eft diagnosis engine
- fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
- io-retire 2.0 active I/O Retire Agent
- snmp-trapgen 1.0 active SNMP Trap Generation Agent
- sysevent-transport 1.0 active SysEvent Transport Agent
- syslog-msgs 1.0
active Syslog Messaging Agent - zfs-diagnosis 1.0 active ZFS Diagnosis Engine
- zfs-retire 1.0 active ZFS Retire Agent
Fault manager logs
- The fault manager maintains two log files:
- The
- error log
- contains a list of errors events that have been sent to the fault manager
daemon
- The
- fault log
- contains a list of problems that have been diagnosed and repaired
- The fault log can be viewed by running fmdump:
- $
- fmdump
- The error log can be viewed with fmdump’s “-e” option:
- $
- fmdump -e
- Fmdump also has a “-u” option to limit the output to a specific UUID, a “-T”
option to display events that occurred during a specific timeframe, and “-v”
and “-V” options to display verbose output
Viewing faulty components
- To view the faulted resources on a system, the
fmadm utility can be run
with the faulty option:
$ fmadm faulty
STATE RESOURCE / UUID
-------- --------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
a0461e5e-4356-ca7b-ee83-c66816b9caba
degraded dev:////pci@8,700000/pci@3
a0461e5e-4356-ca7b-ee83-c66816b9caba
- The output above contains the faulted component, the unique case identifier,
and the state (i.e., ok, unknown, degraded, faulted) the component is
currently in.
Viewing all faults
- To view all of the faults (which includes silent faults) that have occurred on
a system, you can run fmadm with the “faulty” and “-a” options:
- $
- fmadm faulty -a
- This will display all cached faults, including faults that don’t necessarily
indicate a component is faulty (e.g., a single page being retired)
Repairing faults
- Once a problem has been resolved (e.g., a CPU has been replaced), the fmadm
utility can be run with the “repair” option and the UUID to repair
- $
- fmadm repair a0461e5e-4356-ca7b-ee83-c66816b9caba
- This will update the fault manager’s resource cache to indicate that no
problems are present with the components associated with the UUID
Getting notified when things break
- The fault management architecture wouldn’t be all that useful if it didn’t
provide methods to alert people when problems occur
- The fault manager logs diagnosis messages to syslog and the system console
each time a fault is diagnosed, and can be configured to generate SNMPv1 traps
or SNMPv2 notifications
- FMA currently doesn’t have built-in support for email notifications, but third
party tools are available to send email when faults occur
Enabling SNMP support
- To configure FMA to send SNMPv1 traps, you can add one or more - trapsink -
directives to the SNMP daemon’s snmpd.conf configuration file:
- i
- trapsink 192.168.1.100 public 162
- trapsink 192.168.1.101 public 162
- To configure FMA to send SNMPv2 notification, you can add one or more
- informsink
- directives to the SNMP daemon’s snmpd.conf configuration file:
- i
- informsink 192.168.1.100 public 162
- informsink 192.168.1.101 public 162
Getting email when hardware faults occur
- Getting email when hardware faults occur
- Since FMA doesn't contain built-in support for email notifications, I
developed a shell script to send email when the fault manager diagnoses a
fault
- The script is designed to be run from cron and can be downloaded from my
website:
- fmadmnotifier script
Hardware testing utilities
FMA Resources
- FMA demo kit
- FMA programmer's reference manual
- FMA MIB
- Mike Shapiro’s FMA presentation
Conclusion
- FMA is an incredibly powerful technology, and should make every admin smile -
Long gone are the days of confusing error messages, days of fruitless hardware
debugging, extended downtimes and the frustration that goes along with it!