Solaris fault manager overview

One of the coolest features in Solaris 10 in the fault management service. Fault management allows system software to send telemetry data to the fmd(1m) daemon, which then diagnoses the problem, and takes action (e.g., offlining a faulty components and logging an error with FMRI/UUID information to syslog) based on the type of event received. The diagnosis phase is controlled by a set of diagnosis engines, which can be viewed with the fmadm(1m) utilities “config” option:

$ fmadm config

MODULE                   VERSION STATUS  DESCRIPTION
USII-io-diagnosis        1.0     active  UltraSPARC-II I/O Diagnosis
cpumem-retire            1.0     active  CPU/Memory Retire Agent
eft                      1.12    active  eft diagnosis engine
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                1.0     active  I/O Retire Agent
syslog-msgs              1.0     active  Syslog Messaging Agent

If the fault manager daemon (fmd) detects a fault, it will log a detailed message to syslog, and update the fault manager error and fault logs. The contents of these logfiles can be viewed with the fmdump(1m) utility:

$ fmdump -v

TIME UUID SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty

$ fmdump -e -v

TIME                 CLASS                                 ENA
fmdump: /var/fm/fmd/errlog is empty

If a device is diagnosed as faulty, this will be indicated in the fmadm(1m) “faulty” output:

$ fmadm faulty

   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------

The fault management daemon (fmd) keeps track of service events and numerous pieces of key statistical data. This information can be accessed and printed with the fmstat(1m) utility:

$ fmstat

module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
USII-io-diagnosis        0       0  0.0    0.0   0   0     0     0      0      0
cpumem-retire            0       0  0.0    0.0   0   0     0     0      0      0
eft                      0       0  0.0    0.0   0   0     0     0   552K      0
fmd-self-diagnosis       0       0  0.0    0.0   0   0     0     0      0      0
io-retire                0       0  0.0    0.0   0   0     0     0      0      0
syslog-msgs              0       0  0.0    0.0   0   0     0     0    32b      0

If you are interested in learning more about this amazingly cool technology, you can check out the following resources:

Mike Shapiro’s ACM Fault Management Article

Mike Shapiro’s Fault Management Presentation

Leave a Reply

Your email address will not be published. Required fields are marked *