Debugging fibre channel errors on Solaris hosts


While reviewing the system logs on one of my SAN attached servers last week, I noticed hundreds of entries similar to the following:

Aug 28 13:10:14 foo scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Aug 28 13:10:14 foo /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21): Command Timeout on path /pci@9,600000/lpfc@1/fp@0,0 (fp3) Aug 28 13:10:14 foo scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21): Aug 28 13:10:14 foo SCSI transport failed: reason ‘timeout’: retrying command

Since the errors were retryable, it looked like MPxIO was doing it’s job and retrying requests on one of the other paths. To see if all of the paths were up and operational (the host has four paths to disk), I ran the mpathadm utility with the “list” command and the “LU” option:

$ mpathadm list LU

/dev/rdsk/c2t600A0B80001FCB370000010446D3D0C1d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000C746D3DA48d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000CD46D3DC1Cd0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B80001FCB370000010646D3D207d0s2 Total Path Count: 4 Operational Path Count: 4 /dev/rdsk/c2t600A0B8000216462000000CA46D3DB88d0s2 Total Path Count: 4 Operational Path Count: 4

Since all of the paths were available, I started to wonder if a cable was faulty. After running fcinfo on each of the four HBA ports, I came across the following:

$ fcinfo hba-port -l 10000000c94708f2

HBA Port WWN: 10000000c94708f2 OS Device Name: /dev/cfg/c6 Manufacturer: Emulex Model: LP9002L Firmware Version: 3.93a0 FCode/BIOS Version: 1.41a4 Type: N-port State: online Supported Speeds: 1Gb 2Gb Current Speed: 2Gb Node WWN: 20000000c94708f2 Link Error Statistics: Link Failure Count: 0 Loss of Sync Count: 14 Loss of Signal Count: 0 Primitive Seq Protocol Error Count: 0 Invalid Tx Word Count: 198724 Invalid CRC Count: 63412

Bingo! The CRC errors were continuosly increasing, so I knew that either the HBA or fibre channel cable were faulty (as a side note, I can’t wait for the FMA project to harden the emlxs and qlc drivers!). During one of my storage training courses, the instructor mentioned that CRC errors are typically associated with bad cables. Once I swapped out the cable that was connected to the port with the errors, the CRC error counts no longer increased, and the scsi_vhci errors stopped! Niiiiiiiiiiiiiiiiiiice!

This article was posted by Matty on 2007-09-01 09:00:00 -0400 EDT