Replacing failed disk devices with the Solaris Volume Manager
While reviewing a post I left on blogs.sun.com, I noticed that Jerry Jelinek replied to my post on replacing disks managed by the Solaris Volume Manager. I had run into the exact issue he covered in his BLOG, and was glad to see that the disk replacement annoyance was addressed in the latest Solaris Express.
The following example (per Jerrry’s feedback and limited testing in my sandbox) will show how to replace a disk named c0t0d0 using the cfgadm(1m) and metareplace(1m) utilities. The first step is to remove (if they exist) any meta state databases on the disk that needs to be replaced. To locate the locations of all meta state databases, the metadb(1m) command can be run with the “-i” option:
$ metadb -i
If meta devices exist, you can run metadb(1m) with the “-d” option to remove the databases. The following example deletes all meta state databases on slice 7 of the disk (c0t0d0) that we are going to replace:
$ metdb -d c0t0d0s7
Once the meta state databases are removed, you can use cfgadm(1m)’s “-c unconfigure” option to remove an occupant (an entity that lives in a receptacle) from Solaris:
$ cfgadm -c unconfigure c0::dsk/c0t0d0
Once Solaris unconfigures the device, you can physically replace the disk. Once the drive is replaced, you can run cfgadm(1m) with the “-c configure” option to let Solaris know the occupant is available for use:
$ cfgadm -c configure c0::dsk/c0t0d0
Once Solaris know the drive is available, you will need to VTOC the new drive with fmthard(1m) or format(1m). This will add a disk label to the drive, which defines the partions types and sizes. Once a valid VTOC is installed, you can invoke the trusty old metareplace(1m) utility to replace the faulted meta devices. The following example will replace the device associated with meta device d10, and cause the meta device to start synchronizing data from the other half of the mirror ( if RAID level 1 is used):
$ metareplace -e d10 c0t0d0s0
I dig the terminology cfgadm(1m) uses to describe the connections between components. “Receptable,” “occupant” and “atachment point” should be used a bit more universally in the IT industry! :) I recommend testing all examples in a sandbox prior to use in production (i.e., I am not responsible for breaking your stuff)!








Vinay on March 28th, 2006
In the method descirbed above, you are deleting the replicas on this disk c0t0d0s7 before replacing the disk. After the new disk is inserted I don’t see any step to recreate the replicas on the new disk.
The procedure I usually follow is as follows atleast in Solaris Volume manger (Solaris 9 and upwards).
To replace an SVM-controlled disk which is part of a mirror, the following steps must be followed:
1) Run ‘metadetach’ to detach all submirrors on the failing disk from their respective mirrors, and then run ‘metaclear’ (**) on those submirror devices (see below). if you don’t use -f option you will get the following message : “Attempt an operation on a submirror that has erred component”
metadetach –f
metaclear
In our case the failed disk is c1t0d0
metadetach –f d0 d10
metadetach –f d1 d11
metaclear d10
metaclear d11
You can verify there are no existing metadevices left on the disk by
running ‘
metastat -p | grep c1t0d0
2.Next step is to remove the replicas on this disk c1t0d0
To view the replicas on this disk
$ metadb |grep c1t0d0
Wm p l 16 8192 /dev/dsk/c1t0d0s7
W p l 8208 8192 /dev/dsk/c1t0d0s7
If there are any replicas on this disk, remove them using
metadb -d c#t#d#s#
In our case
metadb –d c1t0d0s7
You can verify there are no existing replicas left on the disk by
running
metadb | grep c1t0d0
3. If there are any open filesystems on this disk (not under SVM control), unmount them.
4. Run the ‘cfgadm’ command to remove the failed disk.
cfgadm -c unconfigure c#::dsk/c#t#d#
NOTE: if the message “Hardware specific failure: failed to
unconfigure SCSI device: I/O error” appears, check to make sure that
you cleared all replicas and metadevices from the disk, and that the
disk is not being accessed.
cfgadm –c unconfigure c1::dsk/c1t0d0
5. Insert and configure in the new disk.
cfgadm -c configure c#::dsk/c#t#d#
cfgadm -al (just to confirm that disk is configured properly)
cfgadm –c configure c1::dsk/c1t0d0
cfgadm -al
6. Run ‘format’ or ‘prtvtoc’ to put the desired partition table on the new disk .
Note : The VTOC (volume table of contents) on the root disk and root mirror must be the same. Copy the VTOC using prtvtoc and fmthard.
$ prtvtoc /dev/rdsk/c1t1d0s2 | fmthard –s – /dev/rdsk/c1t0d0s2
7. Run ‘metadevadm’ on the disk, which will update the New DevID.
metadevadm -u c#t#d#
NOTE: If you get the message “Open of /dev/dsk/c#t#d#s0 failed”, you
can safely ignore the message (this is a known bug pending a fix).
metadevadm –u c1t0d0
8. If necessary, recreate any replicas on the new disk:
metadb -a –c 2 c#t#d#s#
metadb –a –c 2 c1t0d0s7
9. Recreate each metadevice to be used as submirrors, and use ‘metattach’ to attach those submirrors to the mirrors to start the resync. Note: If the submirror was something other than a simple one-slice concat device, the metainit command will be different than shown here.
metainit 1 1
metattach
metainit d10 1 1 c1t0d0s0
metattach d0 d10
metainit d11 1 1 c1t0d0s1
metattach d1 d11