Debugging fibre channel errors on Solaris hosts

While reviewing the system logs on one of my SAN attached servers last week, I noticed hundreds of entries similar to the following:

Aug 28 13:10:14 foo scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Aug 28 13:10:14 foo        /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21):
 Command Timeout on path /pci@9,600000/lpfc@1/fp@0,0 (fp3)
Aug 28 13:10:14 foo scsi: [ID 107833 kern.warning] 
 WARNING: /scsi_vhci/ssd@g600a0b80001fcb370000010646d3d207 (ssd21):
Aug 28 13:10:14 foo        SCSI transport failed: reason 'timeout': retrying command

Since the errors were retryable, it looked like MPxIO was doing it’s job and retrying requests on one of the other paths. To see if all of the paths were up and operational (the host has four paths to disk), I ran the mpathadm utility with the “list” command and the “LU” option:

$ mpathadm list LU

        /dev/rdsk/c2t600A0B80001FCB370000010446D3D0C1d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000C746D3DA48d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000CD46D3DC1Cd0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B80001FCB370000010646D3D207d0s2
                Total Path Count: 4
                Operational Path Count: 4
        /dev/rdsk/c2t600A0B8000216462000000CA46D3DB88d0s2
                Total Path Count: 4
                Operational Path Count: 4

Since all of the paths were available, I started to wonder if a cable was faulty. After running fcinfo on each of the four HBA ports, I came across the following:

$ fcinfo hba-port -l 10000000c94708f2

HBA Port WWN: 10000000c94708f2
        OS Device Name: /dev/cfg/c6
        Manufacturer: Emulex
        Model: LP9002L
        Firmware Version: 3.93a0
        FCode/BIOS Version: 1.41a4
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 
        Current Speed: 2Gb 
        Node WWN: 20000000c94708f2
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 14
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 198724
                Invalid CRC Count: 63412

Bingo! The CRC errors were continuosly increasing, so I knew that either the HBA or fibre channel cable were faulty (as a side note, I can’t wait for the FMA project to harden the emlxs and qlc drivers!). During one of my storage training courses, the instructor mentioned that CRC errors are typically associated with bad cables. Once I swapped out the cable that was connected to the port with the errors, the CRC error counts no longer increased, and the scsi_vhci errors stopped! Niiiiiiiiiiiiiiiiiiice!

4 thoughts on “Debugging fibre channel errors on Solaris hosts”

  1. Hi,

    I am playing with Solaris 10 build 70b on x86.

    I have 2 Emulex Fibre cards and can see my EMC SAN disks. However, I am not able to configure multipath.

    I have edited scsi_vhci.conf(correctly I think) to add our EMC 5671 device and enabled multipathing in fp.conf

    But I do not see any output from “mpathadm list LU” .

    Any ideas?

  2. Good fcinfo tips. I’m going to start monitoring for Link Error Stats and see what I findr.

    “During one of my storage training courses…”

    Can you please tell me about this course? I want to get some SAN and Solaris storage training. Thanks.

  3. Thank you so much!

    After hours searching the root cause of my problems I found it was a problem with the GBIC port.

    Regards

Leave a Reply

Your email address will not be published. Required fields are marked *