Replacing failed disk devices with the Solaris Volume Manager

While reviewing a post I left on blogs.sun.com, I noticed that Jerry Jelinek replied to my post on replacing disks managed by the Solaris Volume Manager. I had run into the exact issue he covered in his BLOG, and was glad to see that the disk replacement annoyance was addressed in the latest Solaris Express.

The following example (per Jerrry’s feedback and limited testing in my sandbox) will show how to replace a disk named c0t0d0 using the cfgadm(1m) and metareplace(1m) utilities. The first step is to remove (if they exist) any meta state databases on the disk that needs to be replaced. To locate the locations of all meta state databases, the metadb(1m) command can be run with the “-i” option:

$ metadb -i

If meta devices exist, you can run metadb(1m) with the “-d” option to remove the databases. The following example deletes all meta state databases on slice 7 of the disk (c0t0d0) that we are going to replace:

$ metdb -d c0t0d0s7

Once the meta state databases are removed, you can use cfgadm(1m)’s “-c unconfigure” option to remove an occupant (an entity that lives in a receptacle) from Solaris:

$ cfgadm -c unconfigure c0::dsk/c0t0d0

Once Solaris unconfigures the device, you can physically replace the disk. Once the drive is replaced, you can run cfgadm(1m) with the “-c configure” option to let Solaris know the occupant is available for use:

$ cfgadm -c configure c0::dsk/c0t0d0

Once Solaris know the drive is available, you will need to VTOC the new drive with fmthard(1m) or format(1m). This will add a disk label to the drive, which defines the partions types and sizes. Once a valid VTOC is installed, you can invoke the trusty old metareplace(1m) utility to replace the faulted meta devices. The following example will replace the device associated with meta device d10, and cause the meta device to start synchronizing data from the other half of the mirror ( if RAID level 1 is used):

$ metareplace -e d10 c0t0d0s0

I dig the terminology cfgadm(1m) uses to describe the connections between components. “Receptable,” “occupant” and “atachment point” should be used a bit more universally in the IT industry! :) I recommend testing all examples in a sandbox prior to use in production (i.e., I am not responsible for breaking your stuff)!

10 Comments

Vinay  on March 28th, 2006

In the method descirbed above, you are deleting the replicas on this disk c0t0d0s7 before replacing the disk. After the new disk is inserted I don’t see any step to recreate the replicas on the new disk.
The procedure I usually follow is as follows atleast in Solaris Volume manger (Solaris 9 and upwards).
To replace an SVM-controlled disk which is part of a mirror, the following steps must be followed:

1) Run ‘metadetach’ to detach all submirrors on the failing disk from their respective mirrors, and then run ‘metaclear’ (**) on those submirror devices (see below). if you don’t use -f option you will get the following message : “Attempt an operation on a submirror that has erred component”

metadetach –f
metaclear

In our case the failed disk is c1t0d0

metadetach –f d0 d10
metadetach –f d1 d11
metaclear d10
metaclear d11
You can verify there are no existing metadevices left on the disk by
running ‘

metastat -p | grep c1t0d0
2.Next step is to remove the replicas on this disk c1t0d0
To view the replicas on this disk
$ metadb |grep c1t0d0
Wm p l 16 8192 /dev/dsk/c1t0d0s7
W p l 8208 8192 /dev/dsk/c1t0d0s7
If there are any replicas on this disk, remove them using
metadb -d c#t#d#s#

In our case
metadb –d c1t0d0s7

You can verify there are no existing replicas left on the disk by
running
metadb | grep c1t0d0

3. If there are any open filesystems on this disk (not under SVM control), unmount them.

4. Run the ‘cfgadm’ command to remove the failed disk.
cfgadm -c unconfigure c#::dsk/c#t#d#
NOTE: if the message “Hardware specific failure: failed to
unconfigure SCSI device: I/O error” appears, check to make sure that
you cleared all replicas and metadevices from the disk, and that the
disk is not being accessed.

cfgadm –c unconfigure c1::dsk/c1t0d0
5. Insert and configure in the new disk.
cfgadm -c configure c#::dsk/c#t#d#
cfgadm -al (just to confirm that disk is configured properly)
cfgadm –c configure c1::dsk/c1t0d0
cfgadm -al

6. Run ‘format’ or ‘prtvtoc’ to put the desired partition table on the new disk .
Note : The VTOC (volume table of contents) on the root disk and root mirror must be the same. Copy the VTOC using prtvtoc and fmthard.
$ prtvtoc /dev/rdsk/c1t1d0s2 | fmthard –s – /dev/rdsk/c1t0d0s2
7. Run ‘metadevadm’ on the disk, which will update the New DevID.
metadevadm -u c#t#d#
NOTE: If you get the message “Open of /dev/dsk/c#t#d#s0 failed”, you
can safely ignore the message (this is a known bug pending a fix).
metadevadm –u c1t0d0
8. If necessary, recreate any replicas on the new disk:
metadb -a –c 2 c#t#d#s#

metadb –a –c 2 c1t0d0s7
9. Recreate each metadevice to be used as submirrors, and use ‘metattach’ to attach those submirrors to the mirrors to start the resync. Note: If the submirror was something other than a simple one-slice concat device, the metainit command will be different than shown here.
metainit 1 1
metattach

metainit d10 1 1 c1t0d0s0
metattach d0 d10
metainit d11 1 1 c1t0d0s1
metattach d1 d11

Luke  on August 31st, 2006

Wonderful directions! However, I would point out a pitfall that some may encounter (as I did) in step 4 above. If you have your swap space mirrored, you may get the following error when you try the cfgadm -c unconfigure command:

cfgadm: Component system is busy, try again: failed to offline:
Resource Information
—————— ———————–
/dev/dsk/c1t0d0s1 dump device (dedicated)

I actually had to call Sun about this and they advised me to first verify that the device is in fact configured as a dump device by running:

dumpadm

The output should indicate that the disk you are trying to unconfigure holds a slice that is set as the dump device. If it does, you need to tell Solaris to change the dump device to the corresponding slice on the good disk (in my case the other swap submirror was at: c1t1d0d0s1).

dumpadm -d /dev/dsk/c1t1d0s1

Verify that the dump device is set to the new slice

dumpadm

And then go ahead and use the cfgadm -c unconfigure command to release the disk.

ram praveen  on October 31st, 2006

the concepts that u all have discussed are excellent but i have a doubt ,
What if one disk in a mirror is failed solaris is not able to detect it even to remove those replicas.

-wMz-  on January 16th, 2007

I used these directions recently. Thank you to all for great information :)

I came across a problem in which VxVM had control of the disk I wanted to replace. It turns out VxVM just had control of it i.e. the Multipathing feature of VxVM.

As a result the cfgadm command would not unconfigure the disk. I used vxdiskadm to disable Multipathing on the disk that needed to be replaced …

example steps …

bash-2.05$ vxdiskadm

option 17 – Prevent multipathing / suppress devices from VxVM’s view

yes (return key)

option 2 – Supress a path from VxVM’s view

list (list all paths)

c3t1d0 (my example)

yes (return key)

no (return key)

q (return key)

q (return key)

example of disabled c3t1d0 …

bash-2.05$ vxdisk path

SUBPATH DANAME DMNAME GROUP STATE

c1t0d0s2 c1t0d0s2 se331001 se3310 ENABLED

c0t0d0s2 c1t0d0s2 se331001 se3310 ENABLED

c3t0d0s2 c3t0d0s2 – – ENABLED

c3t1d0s2 c3t1d0s2 – – DISABLED

c3t2d0s2 c3t2d0s2 – – ENABLED

c3t3d0s2 c3t3d0s2 – – ENABLED

I was then able to re-run cfgadm without a problem …

cfgadm –c unconfigure c3::dsk/c3t1d0

example of unconfigured c3t1d0 …

bash-2.05$ cfgadm -al

c3::dsk/c3t1d0 disk connected unconfigured unknown

NOTE: vxdmp automatically re-enables the multipathing on the new replaced disk

cfgadm –c configure c3::dsk/c3t1d0

bash-2.05$ sudo vxdisk path

c3t1d0s2 c3t1d0s2 – – ENABLED

cheers,
-wMz-

Jim Moore  on March 14th, 2008

For years, I’ve been under the impression that when a boot disk is under Solaris Volume Manager, for a core to be properly saved, the dump device, as defined through the command dumpadm, had to match the logical address as seen in the swap -l command. By default, the dump device would read /dev/dsk/c0t0d0s1. Whereas swaps logical address, when under SVM, would read /dev/md/dsk/d1. Playing on an E450 today, it seems like it doesn’t really matter. Have not been able to find any references to the situation via Sun….Any comments….???

Sir June  on July 12th, 2008

i have a failing disk c1t2d0 under Solaris 9:
d15: Concat/Stripe
Size: 573356544 blocks (273 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase Reloc
c1t2d0s2 0 No Yes
c1t3d0s2 20352 No Yes

how to i remove it SVM and re-create the metadevice after the replacement. I am new to Solaris and SVM. please if you can provide steps.

thank you.
sirjune

senthilkumar  on August 3rd, 2009

how to identify an failed harddisk in svm.

Any one Please Advice

Thanks,
Senthilkumar.R

senthilkumar  on August 3rd, 2009

How to identify how many ram is currently utilize in per box in solaris.

Thanks
Senthilkumar.R

anonyme  on May 13th, 2011

How about the installboot to make the disk bootable?

sahil  on November 29th, 2011

Thanks for the great information….but i have a small doubt here…

I have 2 disks and 3 volumes (suppose v1,v2 & v3) are created on different slices under SVM and i found some errors in v3 volume on disk1 so i am planning to replace my 1st disk. My doubt is what will happen to other volumes (v1 & v2) which are using this disk too with a different slice.

please advice….thanks in advance…

Leave a Comment