Solaris fault manager overview


One of the coolest features in Solaris 10 in the fault management service. Fault management allows system software to send telemetry data to the fmd(1m) daemon, which then diagnoses the problem, and takes action (e.g., offlining a faulty components and logging an error with FMRI/UUID information to syslog) based on the type of event received. The diagnosis phase is controlled by a set of diagnosis engines, which can be viewed with the fmadm(1m) utilities “config” option:

$ fmadm config

MODULE VERSION STATUS DESCRIPTION
USII-io-diagnosis 1.0 active UltraSPARC-II I/O Diagnosis
cpumem-retire 1.0 active CPU/Memory Retire Agent
eft 1.12 active eft diagnosis engine
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
syslog-msgs 1.0 active Syslog Messaging Agent

If the fault manager daemon (fmd) detects a fault, it will log a detailed message to syslog, and update the fault manager error and fault logs. The contents of these logfiles can be viewed with the fmdump(1m) utility:

$ fmdump -v

TIME UUID SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty

$ fmdump -e -v

TIME CLASS ENA
fmdump: /var/fm/fmd/errlog is empty

If a device is diagnosed as faulty, this will be indicated in the fmadm(1m) “faulty” output:

$ fmadm faulty

STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------

The fault management daemon (fmd) keeps track of service events and numerous pieces of key statistical data. This information can be accessed and printed with the fmstat(1m) utility:

$ fmstat

module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
USII-io-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0
eft 0 0 0.0 0.0 0 0 0 0 552K 0
fmd-self-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
io-retire 0 0 0.0 0.0 0 0 0 0 0 0
syslog-msgs 0 0 0.0 0.0 0 0 0 0 32b 0

If you are interested in learning more about this amazingly cool technology, you can check out the following resources:

Mike Shapiro’s ACM Fault Management Article

Mike Shapiro’s Fault Management Presentation

This article was posted by Matty on 2005-09-29 14:06:00 -0400 EDT