We live in a world where hardware breaks, and when it does, most adminsitrators want to get notified that something failed, and the specific component that failed. The Sun galaxy server line contains built-in hardware monitoring, and allows hardware events to be sent to administrators through SNMP, email and SYSLOG. The hardware notification facility in the X2200 servers complements the alerting capabilities built-in to the fault management architecture (FMA), and when the two are combined, hardware problems should be relatively easy to diagnose and fix.
To configure the X2200 to send email when a hardware event occurs, an email server and recipient email address need to be configured through the ILOM interface. This can be accomplished by logging into the ILOM and using the set comand to configure a recipient and an SMTP server:
/SP -> set /SP/AgentInfo/mail/receiver1 EmailAddress=ops@prefetch.net
/SP -> set /SP/AgentInfo/mail SMTPServer=1.2.3.4
The X2200 also provides a facility to send syslog messages when a hardare event occurs. To configure the X2200 to generate a syslog message in response to a hardware event, the set command can be run to enable syslog events, and to configure the destination syslog server:
/SP -> set /SP/AgentInfo/SEL ipaddress=1.2.3.4
/SP -> set /SP/AgentInfo/SEL status=enable
Once a syslog destination and email server are configured, the set command can be used to enable hardware events. The following set commands will enable hardware events for all sensor types (e.g., fans, CPUs, memory, etc.) and enable email and syslog notifications:
/SP -> set /SP/AgentInfo/PEF/EventFilterTable1 SensorType=all
/SP -> set /SP/AgentInfo/PEF/EventFilterTable1 SendAlert=enable
/SP -> set /SP/AgentInfo/PEF/EventFilterTable1 SendMail=enable
/SP -> set /SP/AgentInfo/PEF/EventFilterTable1 Status=enable
This is good stuff, but there is one downside with the X2200 ILOM (well, actually, there are a lot more, but I will discuss those in a future blog entry). There is currently no way to generate test events from the ILOM, or through the web interface. This is a huge issue IMHO, and it’s unfortunate that Sun didn’t include this in the X2200 ILOM software (most of the other galaxy servers support this, not sure why the X2200 had to be different). Hopefully Sun will address this glaring issue in a future code release.
On more than one occassion now, I have run into problems where the Solaris boot archive wasn’t in a consistent format at boot time. This stops the boot process, and the console recommends booting into FailSafe mode to fix it. If you want to do this manually, you can run the bootadm utility with the update_archive command, and the location where the root file system is mounted:
$ bootadm update_archive -v -R /a
I am hopeful that the opensolaris community will enhance the archive support to make it more fault tolerant. The current code seems somewhat brittle.
I just came across Rick Moen’s Preventing Domain Expiration article. Rick did a great job with the article, and it’s cool to see that they took my domain-check shell script and implemented it in Perl. The Perl version supports for TLDS, and contains a bit more functionality than the bash implementation. If I get some time in the next few months, I will have to update the domain-check bash script to support the same TLDs as the Perl implementation. Great job Rick and Ben!!
With the introduction of Solaris 10, the Solaris kernel was modified and userland tools were added to detect and report on hardware faults. The fault analysis is handled by the Solaris fault manager, which currently detects and responds (the kernel can retire memory pages, CPUs, etc. when it detects faulty hardware) to failures in AMD and SPARC CPU modules, PCI and PCIe buses, memory modules, disk drives and eventually Intel Xeon processors and system sensors (e.g., fan speed, thermal sensors, etc.).
To see if the fault manager has diagnosed a faulty component on a Solaris 10 or Nevada host, the fmadm utility can be run with the “faulty” option:
$ fmadm faulty
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
The fmadm output includes the suspect component, the state of the component and a unique identifer to identify the fault. Since hardware faults can lead to system and application outages, I like to configure the FMA SNMP agent to send an SNMP trap with the hardware fault details to my NMS station, and I also like to configure my FMA notifier script to send the hardware fault details to my blackberry. The emails that are generated by the fmanotifier script look similar to the following:
From matty@lucky Sat Aug 18 14:58:29 2007
Date: Sat, 18 Aug 2007 14:58:29 -0400T(EDT):00-04:00
From: matty@lucky
To: root@lucky
Subject: Hardware fault on lucky
The fault manager detected a problem with the system hardware.
The fmadm and fmdump utilities can be run to retrieve additional
details on the faults and recommended next course of action.
Fmadm faulty output:
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
I absolutely adore FMA, and wish there was something similar for Linux. Hopefully the Linux kernel engineers will provide similar functionality in the future, since this greatly simplifies the process of identifying broken hardware.
While reviewing the system logs on one of my SAN attached servers last week, I noticed hundreds of entries similar to the following:
Aug 28 13:10:14 foo scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Aug 28 13:10:14 foo /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21): Command Timeout on path /pci@9,600000/lpfc@1/fp@0,0 (fp3) Aug 28 13:10:14 foo scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21): Aug 28 13:10:14 foo SCSI transport failed: reason ‘timeout’: retrying command
Since the errors were retryable, it looked like MPxIO was doing it’s job and retrying requests on one of the other paths. To see if all of the paths were up and operational (the host has four paths to disk), I ran the mpathadm utility with the “list” command and the “LU” option:
$ mpathadm list LU
/dev/rdsk/c2t600A0B80001FCB370000010446D3D0C1d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000C746D3DA48d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000CD46D3DC1Cd0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B80001FCB370000010646D3D207d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000CA46D3DB88d0s2 Total Path Count: 4 Operational Path Count: 4
Since all of the paths were available, I started to wonder if a cable was faulty. After running fcinfo on each of the four HBA ports, I came across the following:
$ fcinfo hba-port -l 10000000c94708f2
HBA Port WWN: 10000000c94708f2 OS Device Name: /dev/cfg/c6 Manufacturer: Emulex Model: LP9002L Firmware Version: 3.93a0 FCode/BIOS Version: 1.41a4 Type: N-port State: online Supported Speeds: 1Gb 2Gb Current Speed: 2Gb Node WWN: 20000000c94708f2 Link Error Statistics: Link Failure Count: 0 Loss of Sync Count: 14 Loss of Signal Count: 0 Primitive Seq Protocol Error Count: 0 Invalid Tx Word Count: 198724 Invalid CRC Count: 63412
Bingo! The CRC errors were continuosly increasing, so I knew that either the HBA or fibre channel cable were faulty (as a side note, I can’t wait for the FMA project to harden the emlxs and qlc drivers!). During one of my storage training courses, the instructor mentioned that CRC errors are typically associated with bad cables. Once I swapped out the cable that was connected to the port with the errors, the CRC error counts no longer increased, and the scsi_vhci errors stopped! Niiiiiiiiiiiiiiiiiiice!