Replacing failed disk devices with the Solaris Volume Manager

While reviewing a post I left on blogs.sun.com, I noticed that Jerry Jelinek replied to my post on replacing disks managed by the Solaris Volume Manager. I had run into the exact issue he covered in his BLOG, and was glad to see that the disk replacement annoyance was addressed in the latest Solaris Express.

The following example (per Jerrry’s feedback and limited testing in my sandbox) will show how to replace a disk named c0t0d0 using the cfgadm(1m) and metareplace(1m) utilities. The first step is to remove (if they exist) any meta state databases on the disk that needs to be replaced. To locate the locations of all meta state databases, the metadb(1m) command can be run with the “-i” option:

$ metadb -i

If meta devices exist, you can run metadb(1m) with the “-d” option to remove the databases. The following example deletes all meta state databases on slice 7 of the disk (c0t0d0) that we are going to replace:

$ metdb -d c0t0d0s7

Once the meta state databases are removed, you can use cfgadm(1m)’s “-c unconfigure” option to remove an occupant (an entity that lives in a receptacle) from Solaris:

$ cfgadm -c unconfigure c0::dsk/c0t0d0

Once Solaris unconfigures the device, you can physically replace the disk. Once the drive is replaced, you can run cfgadm(1m) with the “-c configure” option to let Solaris know the occupant is available for use:

$ cfgadm -c configure c0::dsk/c0t0d0

Once Solaris know the drive is available, you will need to VTOC the new drive with fmthard(1m) or format(1m). This will add a disk label to the drive, which defines the partions types and sizes. Once a valid VTOC is installed, you can invoke the trusty old metareplace(1m) utility to replace the faulted meta devices. The following example will replace the device associated with meta device d10, and cause the meta device to start synchronizing data from the other half of the mirror ( if RAID level 1 is used):

$ metareplace -e d10 c0t0d0s0

I dig the terminology cfgadm(1m) uses to describe the connections between components. “Receptable,” “occupant” and “atachment point” should be used a bit more universally in the IT industry! :) I recommend testing all examples in a sandbox prior to use in production (i.e., I am not responsible for breaking your stuff)!

10 thoughts on “Replacing failed disk devices with the Solaris Volume Manager”

  1. In the method descirbed above, you are deleting the replicas on this disk c0t0d0s7 before replacing the disk. After the new disk is inserted I don’t see any step to recreate the replicas on the new disk.
    The procedure I usually follow is as follows atleast in Solaris Volume manger (Solaris 9 and upwards).
    To replace an SVM-controlled disk which is part of a mirror, the following steps must be followed:

    1) Run ‘metadetach’ to detach all submirrors on the failing disk from their respective mirrors, and then run ‘metaclear’ (**) on those submirror devices (see below). if you don’t use -f option you will get the following message : “Attempt an operation on a submirror that has erred component”

    metadetach –f
    metaclear

    In our case the failed disk is c1t0d0

    metadetach –f d0 d10
    metadetach –f d1 d11
    metaclear d10
    metaclear d11
    You can verify there are no existing metadevices left on the disk by
    running ‘

    metastat -p | grep c1t0d0
    2.Next step is to remove the replicas on this disk c1t0d0
    To view the replicas on this disk
    $ metadb |grep c1t0d0
    Wm p l 16 8192 /dev/dsk/c1t0d0s7
    W p l 8208 8192 /dev/dsk/c1t0d0s7
    If there are any replicas on this disk, remove them using
    metadb -d c#t#d#s#

    In our case
    metadb –d c1t0d0s7

    You can verify there are no existing replicas left on the disk by
    running
    metadb | grep c1t0d0

    3. If there are any open filesystems on this disk (not under SVM control), unmount them.

    4. Run the ‘cfgadm’ command to remove the failed disk.
    cfgadm -c unconfigure c#::dsk/c#t#d#
    NOTE: if the message “Hardware specific failure: failed to
    unconfigure SCSI device: I/O error” appears, check to make sure that
    you cleared all replicas and metadevices from the disk, and that the
    disk is not being accessed.

    cfgadm –c unconfigure c1::dsk/c1t0d0
    5. Insert and configure in the new disk.
    cfgadm -c configure c#::dsk/c#t#d#
    cfgadm -al (just to confirm that disk is configured properly)
    cfgadm –c configure c1::dsk/c1t0d0
    cfgadm -al

    6. Run ‘format’ or ‘prtvtoc’ to put the desired partition table on the new disk .
    Note : The VTOC (volume table of contents) on the root disk and root mirror must be the same. Copy the VTOC using prtvtoc and fmthard.
    $ prtvtoc /dev/rdsk/c1t1d0s2 | fmthard –s – /dev/rdsk/c1t0d0s2
    7. Run ‘metadevadm’ on the disk, which will update the New DevID.
    metadevadm -u c#t#d#
    NOTE: If you get the message “Open of /dev/dsk/c#t#d#s0 failed”, you
    can safely ignore the message (this is a known bug pending a fix).
    metadevadm –u c1t0d0
    8. If necessary, recreate any replicas on the new disk:
    metadb -a –c 2 c#t#d#s#

    metadb –a –c 2 c1t0d0s7
    9. Recreate each metadevice to be used as submirrors, and use ‘metattach’ to attach those submirrors to the mirrors to start the resync. Note: If the submirror was something other than a simple one-slice concat device, the metainit command will be different than shown here.
    metainit 1 1
    metattach

    metainit d10 1 1 c1t0d0s0
    metattach d0 d10
    metainit d11 1 1 c1t0d0s1
    metattach d1 d11

  2. Wonderful directions! However, I would point out a pitfall that some may encounter (as I did) in step 4 above. If you have your swap space mirrored, you may get the following error when you try the cfgadm -c unconfigure command:

    cfgadm: Component system is busy, try again: failed to offline:
    Resource Information
    —————— ———————–
    /dev/dsk/c1t0d0s1 dump device (dedicated)

    I actually had to call Sun about this and they advised me to first verify that the device is in fact configured as a dump device by running:

    dumpadm

    The output should indicate that the disk you are trying to unconfigure holds a slice that is set as the dump device. If it does, you need to tell Solaris to change the dump device to the corresponding slice on the good disk (in my case the other swap submirror was at: c1t1d0d0s1).

    dumpadm -d /dev/dsk/c1t1d0s1

    Verify that the dump device is set to the new slice

    dumpadm

    And then go ahead and use the cfgadm -c unconfigure command to release the disk.

  3. the concepts that u all have discussed are excellent but i have a doubt ,
    What if one disk in a mirror is failed solaris is not able to detect it even to remove those replicas.

  4. I used these directions recently. Thank you to all for great information :)

    I came across a problem in which VxVM had control of the disk I wanted to replace. It turns out VxVM just had control of it i.e. the Multipathing feature of VxVM.

    As a result the cfgadm command would not unconfigure the disk. I used vxdiskadm to disable Multipathing on the disk that needed to be replaced …

    example steps …

    bash-2.05$ vxdiskadm

    option 17 – Prevent multipathing / suppress devices from VxVM’s view

    yes (return key)

    option 2 – Supress a path from VxVM’s view

    list (list all paths)

    c3t1d0 (my example)

    yes (return key)

    no (return key)

    q (return key)

    q (return key)

    example of disabled c3t1d0 …

    bash-2.05$ vxdisk path

    SUBPATH DANAME DMNAME GROUP STATE

    c1t0d0s2 c1t0d0s2 se331001 se3310 ENABLED

    c0t0d0s2 c1t0d0s2 se331001 se3310 ENABLED

    c3t0d0s2 c3t0d0s2 – – ENABLED

    c3t1d0s2 c3t1d0s2 – – DISABLED

    c3t2d0s2 c3t2d0s2 – – ENABLED

    c3t3d0s2 c3t3d0s2 – – ENABLED

    I was then able to re-run cfgadm without a problem …

    cfgadm –c unconfigure c3::dsk/c3t1d0

    example of unconfigured c3t1d0 …

    bash-2.05$ cfgadm -al

    c3::dsk/c3t1d0 disk connected unconfigured unknown

    NOTE: vxdmp automatically re-enables the multipathing on the new replaced disk

    cfgadm –c configure c3::dsk/c3t1d0

    bash-2.05$ sudo vxdisk path

    c3t1d0s2 c3t1d0s2 – – ENABLED

    cheers,
    -wMz-

  5. For years, I’ve been under the impression that when a boot disk is under Solaris Volume Manager, for a core to be properly saved, the dump device, as defined through the command dumpadm, had to match the logical address as seen in the swap -l command. By default, the dump device would read /dev/dsk/c0t0d0s1. Whereas swaps logical address, when under SVM, would read /dev/md/dsk/d1. Playing on an E450 today, it seems like it doesn’t really matter. Have not been able to find any references to the situation via Sun….Any comments….???

  6. i have a failing disk c1t2d0 under Solaris 9:
    d15: Concat/Stripe
    Size: 573356544 blocks (273 GB)
    Stripe 0: (interlace: 32 blocks)
    Device Start Block Dbase Reloc
    c1t2d0s2 0 No Yes
    c1t3d0s2 20352 No Yes

    how to i remove it SVM and re-create the metadevice after the replacement. I am new to Solaris and SVM. please if you can provide steps.

    thank you.
    sirjune

  7. Thanks for the great information….but i have a small doubt here…

    I have 2 disks and 3 volumes (suppose v1,v2 & v3) are created on different slices under SVM and i found some errors in v3 volume on disk1 so i am planning to replace my 1st disk. My doubt is what will happen to other volumes (v1 & v2) which are using this disk too with a different slice.

    please advice….thanks in advance…

Leave a Reply

Your email address will not be published. Required fields are marked *