Brandz support for Solaris 8 and Linux 2.6 kernels

I was pleasantly surprised to find out this week that the brandz framework is being extended to support Linux 2.6 kernels, as well as binaries that were built to run on Solaris 8 hosts! This has lots and lots of potential, and would be a blessing for one of my previous employers (they have a lot of Solaris 8 hosts). I would like to send dibs out to the folks who are making this happen. ;) Niiiiiiiice!

Updating the Solaris boot archive from single user mode

On more than one occassion now, I have run into problems where the Solaris boot archive wasn’t in a consistent format at boot time. This stops the boot process, and the console recommends booting into FailSafe mode to fix it. If you want to do this manually, you can run the bootadm utility with the update_archive command, and the location where the root file system is mounted:

$ bootadm update_archive -v -R /a

I am hopeful that the opensolaris community will enhance the archive support to make it more fault tolerant. The current code seems somewhat brittle.

Preventing domain expiration article

I just came across Rick Moen’s Preventing Domain Expiration article. Rick did a great job with the article, and it’s cool to see that they took my domain-check shell script and implemented it in Perl. The Perl version supports for TLDS, and contains a bit more functionality than the bash implementation. If I get some time in the next few months, I will have to update the domain-check bash script to support the same TLDs as the Perl implementation. Great job Rick and Ben!!

Getting notified when hardware breaks

With the introduction of Solaris 10, the Solaris kernel was modified and userland tools were added to detect and report on hardware faults. The fault analysis is handled by the Solaris fault manager, which currently detects and responds (the kernel can retire memory pages, CPUs, etc. when it detects faulty hardware) to failures in AMD and SPARC CPU modules, PCI and PCIe buses, memory modules, disk drives and eventually Intel Xeon processors and system sensors (e.g., fan speed, thermal sensors, etc.).

To see if the fault manager has diagnosed a faulty component on a Solaris 10 or Nevada host, the fmadm utility can be run with the “faulty” option:

$ fmadm faulty

   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

The fmadm output includes the suspect component, the state of the component and a unique identifer to identify the fault. Since hardware faults can lead to system and application outages, I like to configure the FMA SNMP agent to send an SNMP trap with the hardware fault details to my NMS station, and I also like to configure my FMA notifier script to send the hardware fault details to my blackberry. The emails that are generated by the fmanotifier script look similar to the following:

From matty@lucky Sat Aug 18 14:58:29 2007
Date: Sat, 18 Aug 2007 14:58:29 -0400 (EDT)
From: matty@lucky
To: root@lucky
Subject: Hardware fault on lucky

The fault manager detected a problem with the system hardware.
The fmadm and fmdump utilities can be run to retrieve additional
details on the faults and recommended next course of action. 

Fmadm faulty output:

   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@4
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/lpfc@5
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@2
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/pci@3
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded dev:////pci@8,700000/scsi@1
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=emlxs/mod-id=101
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=glm/mod-id=146
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------
degraded mod:///mod-name=pci_pci/mod-id=132
         a0461e5e-4356-ca7b-ee83-c66816b9caba
-------- ----------------------------------------------------------------------

I absolutely adore FMA, and wish there was something similar for Linux. Hopefully the Linux kernel engineers will provide similar functionality in the future, since this greatly simplifies the process of identifying broken hardware.

Debugging fibre channel errors on Solaris hosts

While reviewing the system logs on one of my SAN attached servers last week, I noticed hundreds of entries similar to the following:

Aug 28 13:10:14 foo scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Aug 28 13:10:14 foo        /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21):
 Command Timeout on path /pci@9,600000/lpfc@1/fp@0,0 (fp3)
Aug 28 13:10:14 foo scsi: [ID 107833 kern.warning] 
 WARNING: /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21):
Aug 28 13:10:14 foo        SCSI transport failed: reason 'timeout': retrying command

Since the errors were retryable, it looked like MPxIO was doing it’s job and retrying requests on one of the other paths. To see if all of the paths were up and operational (the host has four paths to disk), I ran the mpathadm utility with the “list” command and the “LU” option:

$ mpathadm list LU

        /dev/rdsk/c2t600A0B80001FCB370000010446D3D0C1d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000C746D3DA48d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000CD46D3DC1Cd0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B80001FCB370000010646D3D207d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000CA46D3DB88d0s2
                Total Path Count: 4
                Operational Path Count: 4

Since all of the paths were available, I started to wonder if a cable was faulty. After running fcinfo on each of the four HBA ports, I came across the following:

$ fcinfo hba-port -l 10000000c94708f2

HBA Port WWN: 10000000c94708f2
        OS Device Name: /dev/cfg/c6
        Manufacturer: Emulex
        Model: LP9002L
        Firmware Version: 3.93a0
        FCode/BIOS Version: 1.41a4
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 
        Current Speed: 2Gb 
        Node WWN: 20000000c94708f2
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 14
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 198724
                Invalid CRC Count: 63412

Bingo! The CRC errors were continuosly increasing, so I knew that either the HBA or fibre channel cable were faulty (as a side note, I can’t wait for the FMA project to harden the emlxs and qlc drivers!). During one of my storage training courses, the instructor mentioned that CRC errors are typically associated with bad cables. Once I swapped out the cable that was connected to the port with the errors, the CRC error counts no longer increased, and the scsi_vhci errors stopped! Niiiiiiiiiiiiiiiiiiice!