Top Button Left Second Button Left Third Button Left Fourth Button Left Fifth Button Left Sixth Button left Seventh Button left Eight Button Left

Managing Solaris Volume Manager and Disksuite hot spares


Hot spare devices are common in most server environments, and are used to automatically replace disk devices when they fail. The replacement process is handled by the RAID controller or Logical Volume Manager (LVM), and is for the most part transparent (other than the I/Os that occur when the new device is synchronized from parity or a RAID-1 mirror). The Solaris Volume Manager (SVM) supports hot sparing with hot spares pools, which are a collection of devices devoted to being hot spares. This pool is associated with one or more meta devices, which are configured through the metahs(1m) utility:

$ metahs -a hsp001 c1t5d0s0

$ metahs -a hsp001 c1t6d0s0

This example created a new hot spare pool named hsp001, and assigns two devices to it. We can view the contents of the hot spare pools with metahs's "-i" option:

$ metahs -i
hsp001: 2 hot spares
        Device     Status      Length           Reloc
        c1t6d0s0   Available    35523720 blocks Yes
        c1t5d0s0   Available    35523720 blocks Yes

This displays both devices that are currently assigned to the pool, and includes a status field to indicate if the drive is actively being used to replace a faulted device. Once a hot spare pool is created, it needs to be attached to a meta device with the metaparam(1m) utility:

$ metaparam -h hsp001 d5

This will attach the hot spare pool hsp001 to the meta device d5. To see which hot spare pool is attached to a meta device, you can run metastat(1m) and look for the "Hot spare pool" attribute:

$ metastat d5
d5: RAID
    State: Okay
    Hot spare pool: hsp001
    Interlace: 128 blocks
    Size: 106085968 blocks (50 GB)
Original device:
    Size: 106086528 blocks (50 GB)
        Device     Start Block  Dbase        State Reloc  Hot Spare
        c1t1d0s0       6002        No         Okay   Yes
        c1t2d0s0       4926        No         Okay   Yes
        c1t3d0s0       4926        No         Okay   Yes
        c1t4d0s0       4926        No         Okay   Yes

When a disk fails, the kernel will usually log errors similar to the following:

Jul  1 22:42:52 tigger scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@2,0 (sd3):
Jul  1 22:42:52 tigger  Error for Command: read(10)                Error Level: Fatal
Jul  1 22:42:52 tigger scsi: [ID 107833 kern.notice]    Requested Block: 26672702                  Error Block: 26672733
Jul  1 22:42:52 tigger scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: NM020253
Jul  1 22:42:52 tigger scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Jul  1 22:42:52 tigger scsi: [ID 107833 kern.notice]    ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Jul  1 22:42:52 tigger md_raid: [ID 371651 kern.warning] WARNING: md d5: write error on /dev/dsk/c1t2d0s0
Jul  1 22:42:52 tigger md_raid: [ID 104909 kern.warning] WARNING: md: d5: /dev/dsk/c1t2d0s0 needs maintenance
Jul  1 22:42:53 tigger md_raid: [ID 241980 kern.notice] NOTICE: md: d5: hotspared device /dev/dsk/c1t2d0s0 with /dev/dsk/c1t6d0s0

This output indicates that block device c1t2d0s0 failed, and hot spare c1t6d0s0 took over. Since this is a RAID5 meta device, recovery is super painful, since we need to recreate the data and parity on the hot spare from the remaining members. We can monitor the rebuild process by running the metastat(1m) command:

$ metastat d5
d5: RAID
    State: Resyncing
    Resync in progress:  1.6% done
    Hot spare pool: hsp001
    Interlace: 128 blocks
    Size: 106085968 blocks (50 GB)
Original device:
    Size: 106086528 blocks (50 GB)
        Device     Start Block  Dbase        State Reloc  Hot Spare
        c1t1d0s0       6002        No         Okay   Yes
        c1t2d0s0       4926        No    Resyncing   Yes c1t6d0s0
        c1t3d0s0       4926        No         Okay   Yes
        c1t4d0s0       4926        No         Okay   Yes

At some point in the future, you will most likely want to replace the drive, and migrate the data from the hot spare back to the original device. This usually requires replacing the physical drive, updating the Solaris device tree with cfgadm(1m) and devfsadm(1m), and running metadevadm(1m) to update the device relocation data in the meta state database:

$ metadevadm -v -u c1t2d0s0

Updating Solaris Volume Manager device relocation information for c1t2d0
Old device reloc information:
        id1,sd@SSEAGATE_SX318203LC______LR869054____102424W2
New device reloc information:
        id1,sd@SSEAGATE_SX318203LC______LRA45701____10272998

Once these activities complete, the metareplace(1m) utility can be used to "swap" the hot spare with the original device:

$ metareplace -e d5 c1t2d0s0
d5: device c1t2d0s0 is enabled

$ metastat d5
d5: RAID
    State: Resyncing
    Resync in progress:  0.1% done
    Hot spare pool: hsp001
    Interlace: 32 blocks
    Size: 106085968 blocks (50 GB)
Original device:
    Size: 106089600 blocks (50 GB)
        Device     Start Block  Dbase        State Reloc  Hot Spare
        c1t1d0s0       5042        No         Okay   Yes
        c1t2d0s0       3966        No    Resyncing   Yes c1t5d0s0
        c1t3d0s0       3966        No         Okay   Yes
        c1t4d0s0       3966        No         Okay   Yes

Once the data is recreated on the original device, the metastat(1m) output returns to normal:

$ metastat d5
d5: RAID
    State: Okay
    Hot spare pool: hsp001
    Interlace: 32 blocks
    Size: 106085968 blocks (50 GB)
Original device:
    Size: 106089600 blocks (50 GB)
        Device     Start Block  Dbase        State Reloc  Hot Spare
        c1t1d0s0       5042        No         Okay   Yes
        c1t2d0s0       3966        No         Okay   Yes
        c1t3d0s0       3966        No         Okay   Yes
        c1t4d0s0       3966        No         Okay   Yes

The SVM administrators guide has a tons of information on recoverability and hot spares, and is an extremely thorough piece of documentation (it is also valuable for preparing for the SVM certification). If you have questions or comments on the article, please feel free to E-mail the author.

References


The following references were used while writing this article: