Dynamically discovering Clariion LUNs with Linux

One of the Redhat Enterprise Linux 4 update 3 servers I support had several LUNs added to it this week. The server was using Qlogic 2340 HBAs, which allowed me to use the ql-dynamic-tgt-lun-disc.sh script from the Qlogic support site to dynamically discover the new LUNs:

$ ql-dynamic-tgt-lun-disc.sh

Scanning HOST: host1
....
Scanning HOST: host2
.............
Scanning HOST: host3
....

Found

 1:0:0:6
 1:0:0:8
 1:0:1:6
 1:0:1:8
 3:0:0:6
 3:0:0:8
 3:0:1:6
 3:0:1:8

After the script completed device discovery, two devices were visible for each LUN (there are multiple paths to the disk storage) in the output of fdisk. To allow use to take advantage of both paths, we needed to create an EMC power device (we are using EMC powerpth instead of dm-mulipath). This was accomplished by running powermt with the “config” option:

$ powermt config

Once the config operation completed, the power device was visible:

$ powermt display dev=emcpoweri

Pseudo name=emcpoweri
CLARiiON ID=APM [Foo]
Logical device ID=0987 [LUN 80 - DS Foo]
state=alive; policy=CLAROpt; priority=0; queued-IOs=0
Owner: default=SP A, current=SP B

< ..... >

And available for general purpose use. I have bumped into numerous kernel bugs in the past that prevented me from dynamically discovering storage, so this was a welcome change. Having used both Qlogic and Emulex adaptors on Solaris and Linux hosts, I think I still prefer Emulex adaptors over Qlogic adaptors.

Checking devices for bad sectors

I recently had a friend contact me because he was getting an error similar to the following in his Redhat Linux system log (I didn’t save the error while debugging the problem, so I grabbed this one from the web):

kernel: disk I/O error: dev 08:01, sector 25590410
kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002

At first glance, I thought the disk drive had failed, and told him to back up all of his data to safe media. Once the data was backed up, I decided to run a full SMART self test on the disk drive to check the drives health:

$ smartctl -t long /dev/hda

smartctl version 5.36 [sparc-sun-solaris2.10] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 84 minutes for test to complete.
Test will complete after Sun Aug 27 19:41:01 2006

Use smartctl -X to abort test.

The SMART long test completed successfully, but dd was failing when attempting to read sector 25590410 (we weren’t using the continue on error option). Since all modern disk drive controllers contain logic to remap faulty sectors when they are detected, and the number of reallocated sectors as reported by smartctl was well below the manufacturers failure threshold, I wondered if the sector was “stuck.” To test my theory, I booted from a Linux CD, and ran the Linux badblocks utility on the disk partition (I didn’t save the badblocks output from his drive, so the following is a sample from another machine):

$ badblocks -sv /dev/hda

Checking blocks 0 to 8192016
Checking for bad blocks (read-only test):    222400/  8192016

Badblocks completed successfully, and an fsck of the file system reported that the file system was clean (We also used the ext3 file system debugger to see if a file was using the block. It wasn’t, so my theory is that the errors occurred when a new file was being created). Next we rebooted the system, and the number of reallocated sectors reported by smartmontools had increased by one. This completely surprised me, and I am still confused why the disk controller didn’t remap the sector when we were booted from the disk drive. I had fun debugging this problem, and learning about how IDE disk drives work.