Removing duplicate devices from vxdisk list

I replaced a disk in one of our A5200s last week, and noticed that vxdisk was displaying two entries for the device once I replaced it with vxdiskadm:

$ vxdisk list

DEVICE       TYPE      DISK         GROUP        STATUS
c7t21d0s2    sliced    disk01       oradg        online
c7t22d0s2    sliced    disk02       oradg        error
c7t22d0s2    sliced    -            -            error
c7t23d0s2    sliced    disk03       oradg        online

To fix this annoyance, I first removed the disk disk02 from the oradg disk group:

$ vxdg -g oradg rmdisk disk02

Once the disk was removed, I ran vxdisk “remove” two times to remove both disk access records:

$ vxdisk rm c7t22d0s2

$ vxdisk rm c7t22d0s2

After both device access records were removed, I executed ‘devfsadm -C’ to clean the Solaris device tree, and then ran ‘vxdctl enable’ to have Veritas update the list of devices it knows about. After these oeprations completed, the device showed up once in the vxdisk output:

$ vxdisk list

DEVICE       TYPE      DISK         GROUP        STATUS
c7t21d0s2    sliced    disk01       oradg        online
c7t22d0s2    sliced    disk02       oradg        online
c7t23d0s2    sliced    disk03       oradg        online

I have seen times where the Solaris device tree will hold on to old entries, which unfortunately requires a reboot to fix. Luckily for me, this wasn’t the case with my system. Shibby!

Locating disk drives in a sea of A5200s

I manage about a dozen Sun A5200 storage arrays, and periodically need to replace failed disk drives in these arrays. To ensure that I replace the correct device, I first use the format utility to locate the physical device path to the faulted drive:

$ format

                    < ..... >
43. c7t22d0 
   /sbus@3,0/SUNW,socal@0,0/sf@0,0/ssd@w22000004cf995f6c,0

Once I know which device to replace, I use the luxadm “remove_device” option to remove the drive for replacement, and then run luxadm with the “led_blink” option to turn an amber LED on and off next to the faulted drive:

$ luxadm led_blink “/devices/sbus@3,0/SUNW,socal@0,0/sf@0,0/ssd@w22000004cf995f6c,0:a,raw”

Once I enable the led_blink option, I wander down to the data center, locate the drive with the blinking light, and swap out the failed disk with a new disk. Even though the A5200 is an extremely old storage array, I still thoroughly enjoy managing them.

Monitoring md device rebuilds

One super useful utility that ships with CentOS 4.4 is the watch utility. Watch allows you to monitor the output from a command at a specific interval, which is especially useful for monitoring array rebuilds. To use watch, you need to run it with a command to watch, and an optional interval to control how often the output from that command is displayed:

$ watch –interval=10 cat mdstat

Every 2.0s: cat mdstat                                                                   Mon Mar  5 22:30:58 2007

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
      8385856 blocks [2/2] [UU]

md2 : active raid5 sdg1[5] sdf1[3] sde1[2] sdd1[1] sdc1[0]
      976751616 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
      [=>...................]  recovery =  9.8% (24068292/244187904) finish=161.1min speed=22764K/sec

md0 : active raid1 sdb1[1] sda1[0]
      235793920 blocks [2/2] [UU]

unused devices: 

Adding a hot spare to an md device

I am running CentOS 4.4 on some old servers, and each of these servers has multiple internal disk drives. Since system availability concerns me more than the amount of storage that is available, I decided to add a hot spare to the md device that stores my data (md2). To add the hot spare, I ran the mdadm utility with the “–add” option, the md device to add the spare to, and the spare device to use:

$ /sbin/mdadm –add /dev/md2 /dev/sdh1
mdadm: added /dev/sdh1

After the spare was added, the device showed up in the /proc/mdstat output with the “(S)” string to indicate that it’s a hot spare:

$ cat /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid1 sdb2[1] sda2[0]
      8385856 blocks [2/2] [UU]
      bitmap: 0/128 pages [0KB], 32KB chunk

md2 : active raid5 sdh1[5](S) sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
      976751616 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 3/233 pages [12KB], 512KB chunk

md0 : active raid1 sdb1[1] sda1[0]
      235793920 blocks [2/2] [UU]
      bitmap: 7/225 pages [28KB], 512KB chunk

unused devices: 

Getting live upgrade to work with a separate /var

While performing a live upgrade from Nevada build 54 to Nevada build 57, I bumped into the following error:

$ lucreate -n Nevada_B57 -m /:/dev/dsk/c1d0s0:ufs -m /var:/dev/dsk/c1d0s3:ufs -m -:/dev/dsk/c1d0s1:swap

Discovering physical storage devices
Discovering logical storage devices
Cross referencing storage devices with boot environment configurations
Determining types of file systems supported
Validating file system requests
Preparing logical storage devices
Preparing physical storage devices
Configuring physical storage devices
Configuring logical storage devices
Analyzing system configuration.
Comparing source boot environment file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Searching /dev for possible boot environment filesystem devices

Template entry /var:/dev/dsk/c1d0s3:ufs skipped.

luconfig: ERROR: Template filesystem definition failed for /var, all devices are not applicable..
ERROR: Configuration of boot environment failed.

The error message provided little information on what the actual problem was, and when I removed “-m /var:/dev/dsk/c1d0s3:ufs” from the lucreate command line, everything worked as expected. Being extremely baffled by this problem, I started reading through the opensolaris.org installation forum, and eventually came across a post from Nils Nieuwejaar. Nils mentioned that he had debugged an issue where the partition flags weren’t set to “wm”, and this had caused his live upgrade to fail. I used Nils feedback, and went into format to change the partitiong flags for the new “/var” file system to “wm”. Once I saved my changes and ran lucreate again, everything worked as expected:

$ lucreate -n Nevada_B57 -m /:/dev/dsk/c1d0s0:ufs -m /var:/dev/dsk/c1d0s3:ufs -m -:/dev/dsk/c1d0s1:swap

Discovering physical storage devices
Discovering logical storage devices
Cross referencing storage devices with boot environment configurations
Determining types of file systems supported
Validating file system requests
Preparing logical storage devices
Preparing physical storage devices
Configuring physical storage devices
Configuring logical storage devices
Analyzing system configuration.
Comparing source boot environment  file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Searching /dev for possible boot environment filesystem devices
                             
Updating system configuration files.
The device  is not a root device for any boot environment; cannot get BE ID.
Creating configuration for boot environment .
Source boot environment is .
Creating boot environment .
Checking for GRUB menu on boot environment .
The boot environment  does not contain the GRUB menu.
Creating file systems on boot environment .
Creating  file system for  in zone  on .
Creating  file system for  in zone  on .
Mounting file systems for boot environment .
Calculating required sizes of file systems for boot environment .
Populating file systems on boot environment .
Checking selection integrity.
Integrity check OK.
Populating contents of mount point .
Populating contents of mount point .
    < ..... >

Now to convince the live upgrade developers to clean up their error messages. :)

Viewing function calls with whocalls

While catching up with various opensolaris.org mailing lists, I came across a post that described the whocalls utility. This nifty little utility can be used to view the stack frames leading up to a call to a specific function, which can be super useful for debugging. To view all of the code paths leading up to the printf function being called, whocalls can be run with the name of the function to look for, and the executable that we want to analyze for calls to that function:

$ whocalls printf /bin/ls

printf(0x80541b0, 0x8067800, 0x80653a8)
        /usr/bin/ls:pentry+0x593
        /usr/bin/ls:pem+0xb1
        /usr/bin/ls:pdirectory+0x266
        /usr/bin/ls:main+0x70e
        /usr/bin/ls:_start+0x7a
printf(0x80541b0, 0x8067a48, 0x80653a8)
        /usr/bin/ls:pentry+0x593
        /usr/bin/ls:pem+0xb1
        /usr/bin/ls:pdirectory+0x266
        /usr/bin/ls:main+0x70e
        /usr/bin/ls:_start+0x7a
      < ..... >

Now to do some research on the runtime linker’s auditing facilities in /usr/lib/link_audit/*!