Hot spare devices are common in most server environments, and are used to automatically replace disk devices when they fail. The replacement process is handled by the RAID controller or Logical Volume Manager (LVM), and is for the most part transparent (other than the I/Os that occur when the new device is synchronized from parity or a RAID-1 mirror). The Solaris Volume Manager (SVM) supports hot sparing with hot spares pools, which are a collection of devices devoted to being hot spares. This pool is associated with one or more meta devices, which are configured through the metahs(1m) utility:
$ metahs -a hsp001 c1t5d0s0 $ metahs -a hsp001 c1t6d0s0
This example created a new hot spare pool named hsp001, and assigns two devices to it. We can view the contents of the hot spare pools with metahs's "-i" option:
$ metahs -i
hsp001: 2 hot spares
Device Status Length Reloc
c1t6d0s0 Available 35523720 blocks Yes
c1t5d0s0 Available 35523720 blocks Yes
This displays both devices that are currently assigned to the pool, and includes a status field to indicate if the drive is actively being used to replace a faulted device. Once a hot spare pool is created, it needs to be attached to a meta device with the metaparam(1m) utility:
$ metaparam -h hsp001 d5This will attach the hot spare pool hsp001 to the meta device d5. To see which hot spare pool is attached to a meta device, you can run metastat(1m) and look for the "Hot spare pool" attribute:
$ metastat d5
d5: RAID
State: Okay
Hot spare pool: hsp001
Interlace: 128 blocks
Size: 106085968 blocks (50 GB)
Original device:
Size: 106086528 blocks (50 GB)
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s0 6002 No Okay Yes
c1t2d0s0 4926 No Okay Yes
c1t3d0s0 4926 No Okay Yes
c1t4d0s0 4926 No Okay Yes
When a disk fails, the kernel will usually log errors similar to the following:
Jul 1 22:42:52 tigger scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@2,0 (sd3): Jul 1 22:42:52 tigger Error for Command: read(10) Error Level: Fatal Jul 1 22:42:52 tigger scsi: [ID 107833 kern.notice] Requested Block: 26672702 Error Block: 26672733 Jul 1 22:42:52 tigger scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: NM020253 Jul 1 22:42:52 tigger scsi: [ID 107833 kern.notice] Sense Key: Media Error Jul 1 22:42:52 tigger scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4 Jul 1 22:42:52 tigger md_raid: [ID 371651 kern.warning] WARNING: md d5: write error on /dev/dsk/c1t2d0s0 Jul 1 22:42:52 tigger md_raid: [ID 104909 kern.warning] WARNING: md: d5: /dev/dsk/c1t2d0s0 needs maintenance Jul 1 22:42:53 tigger md_raid: [ID 241980 kern.notice] NOTICE: md: d5: hotspared device /dev/dsk/c1t2d0s0 with /dev/dsk/c1t6d0s0
This output indicates that block device c1t2d0s0 failed, and hot spare c1t6d0s0 took over. Since this is a RAID5 meta device, recovery is super painful, since we need to recreate the data and parity on the hot spare from the remaining members. We can monitor the rebuild process by running the metastat(1m) command:
$ metastat d5
d5: RAID
State: Resyncing
Resync in progress: 1.6% done
Hot spare pool: hsp001
Interlace: 128 blocks
Size: 106085968 blocks (50 GB)
Original device:
Size: 106086528 blocks (50 GB)
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s0 6002 No Okay Yes
c1t2d0s0 4926 No Resyncing Yes c1t6d0s0
c1t3d0s0 4926 No Okay Yes
c1t4d0s0 4926 No Okay Yes
At some point in the future, you will most likely want to replace the drive, and migrate the data from the hot spare back to the original device. This usually requires replacing the physical drive, updating the Solaris device tree with cfgadm(1m) and devfsadm(1m), and running metadevadm(1m) to update the device relocation data in the meta state database:
$ metadevadm -v -u c1t2d0s0
Updating Solaris Volume Manager device relocation information for c1t2d0
Old device reloc information:
id1,sd@SSEAGATE_SX318203LC______LR869054____102424W2
New device reloc information:
id1,sd@SSEAGATE_SX318203LC______LRA45701____10272998
Once these activities complete, the metareplace(1m) utility can be used to "swap" the hot spare with the original device:
$ metareplace -e d5 c1t2d0s0
d5: device c1t2d0s0 is enabled
$ metastat d5
d5: RAID
State: Resyncing
Resync in progress: 0.1% done
Hot spare pool: hsp001
Interlace: 32 blocks
Size: 106085968 blocks (50 GB)
Original device:
Size: 106089600 blocks (50 GB)
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s0 5042 No Okay Yes
c1t2d0s0 3966 No Resyncing Yes c1t5d0s0
c1t3d0s0 3966 No Okay Yes
c1t4d0s0 3966 No Okay Yes
Once the data is recreated on the original device, the metastat(1m) output returns to normal:
$ metastat d5
d5: RAID
State: Okay
Hot spare pool: hsp001
Interlace: 32 blocks
Size: 106085968 blocks (50 GB)
Original device:
Size: 106089600 blocks (50 GB)
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s0 5042 No Okay Yes
c1t2d0s0 3966 No Okay Yes
c1t3d0s0 3966 No Okay Yes
c1t4d0s0 3966 No Okay Yes
The SVM administrators guide has a tons of information on recoverability and hot spares, and is an extremely thorough piece of documentation (it is also valuable for preparing for the SVM certification). If you have questions or comments on the article, please feel free to E-mail the author.
The following references were used while writing this article: