md[0-9]+ entries

This article was posted by Matty on 2007-02-11 15:13:00 -0400 -0400

While upgrading my desktop this weekend to Fedora Core 6, I received the following error while attempting to start one of my md arrays:

$ /sbin/mdadm -A /dev/md3 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh
mdadm: error opening /dev/md3: No such file or directory

To fix the issue, I had to cd into /dev and add some additional md entries with the MAKEDEV executable:

$ cd /dev && ./MAKEDEV md

Once I ran MAKDEV, mdadm was able to start up the array:

$ /sbin/mdadm -A /dev/md3 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh
mdadm: /dev/md3 has been started with 6 drives.

Instead of going through the hassle of running MAKDEV, it looks like you can also use the mdadm “-a” option:

-a, --auto{=no,yes,md,mdp,part,p}{NN}
Instruct mdadm to create the device file if needed, possibly allocating
an unused minor
number. "md" causes a non-partitionable array to be used. "mdp", "part"
or "p" causes
a partitionable array (2.6 and later) to be used. "yes" requires the
named md device
to have a ’standard’ format, and the type and minor number will be
determined from
this. See DEVICE NAMES below.

Recovering from Solaris hangs with the deadman timer

This article was posted by Matty on 2007-02-11 10:16:00 -0400 -0400

Periodically a nasty bug will rear it’s head with Solaris or the latest build of Nevada, and the operating system will hang for no apparent reason. Recovering from a hang typically requires the administrator to reboot the host, which can delay the time it takes to get the system back to a working state. One nice feature built into Solaris to assist with system hangs is the deadman timer. When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second, which will in turn cause the kernel lbolt variable to be updated. If the deadman timer detects that that lbolt variable hasn’t changed for a period of time (the default is 500 seconds), it will induce a panic, which will cause a core file to be written to /var/crash (or the location you configured with dumpadm). To enable the deadman timer, you can set the “snooping” variable to 1 in /etc/system:

set snooping=1

If you would like the deadman to wait more (or less) than 500 seconds prior to inducing a panic, you can set the “snoop_interval” variable to the desired number of seconds 100000 (the following example will induce a panic if the lbolt variable hasn’t been updated after 90-seconds):

set snoop_interval=9000000

This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)

Not grep!

This article was posted by Matty on 2007-02-09 22:11:00 -0400 -0400

While reviewing some shell scripts last week, I saw the infamous find | grep:

$ /usr/bin/find /foo -type f | egrep -v \.inp

I am not real sure why more people don’t leverage the logic operations build into find:

$ /usr/bin/find /foo -type f -not -name \.inp

This saves a fork() and exec(), and should be a bit faster. I am curious if folks use grep because it’s easier to read, or because they don’t know about the logic operations built into find. I shall need to investigate …

Using the ZFS RAIDZ2 and hot spare features on old storage arrays

This article was posted by Matty on 2007-02-09 21:58:00 -0400 -0400

I support some antiquated Sun disk arrays (D1000s, T3s, A5200s, etc), and due to the age of the hardware, it is somewhat prone to failure. FMA helps prevent outages due to CPU and memory problems, but it doesn’t yet support diagnosing disk drive errors. Since disk drives are prone to failure, I have started creating RAIDZ2 (dual parity RAIDZ) pools with multiple hot spares to protect our data. RAIDZ2 and hot spares are available in the 11/06 release of Solaris 10, and are both super easy to configure.

To create a RAIDZ2 pool, you can run the zpool utility with the “create” option, the “raidz2” keyword, the name of the pool to create, and the disks to add to the pool:

$ zpool create rz2pool raidz2 c1t9d0 c1t10d0 c1t12d0 c2t1d0 c2t2d0

Once the pool is created, the layout can be viewed with the zpool utility:

$ zpool status

pool: rz2pool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rz2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t9d0 ONLINE 0 0 0
c1t10d0 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0

errors: No known data errors

RAIDZ2 allows 2 disks in a pool to fail without data loss, which is ideal for sites that are more concerned with data integrity than performance. On my ancient storage subsystems, I like to combine RAIDZ2 with several hot spares to allow the pool to automatically recover each time a disk bites the dust. To add one or more hot spares to a pool, you can run the zpool utility with the “add” option, the “spare” keyword, and the device to turn into a spare:

$ zpool add rz2pool spare c2t3d0

pool: rz2pool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rz2pool ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t9d0 ONLINE 0 0 0
c1t10d0 ONLINE 0 0 0
c1t12d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
spares
c2t3d0 AVAIL

errors: No known data errors

Luckily we are using most of the storage listed above in development, QE and testing environments, so performance isn’t super critical (downtime is much more costly).

Removing two RPMs with the same name

This article was posted by Matty on 2007-02-05 21:08:00 -0400 -0400

While performing an upgrade this weekend, I ran into an issue where a few packages were showing up twice in the rpm query output:

$ rpm -q -a | grep frysk
frysk-0.0.1.2006.10.02.rh1-1.fc6 frysk-0.0.1.2006.10.02.rh1-1.fc6

This was causing several package installations to fail, and was preventing me from upgrading several packages with known security issues. When I tried to remove the frysk package, I was greeted with the following error message:

$ rpm -e frysk-0.0.1.2006.10.02.rh1-1.fc6
error: “frysk-0.0.1.2006.10.02.rh1-1.fc6” specifies multiple packages

After reading through the rpm manual page, I came across the “–allmatches” option. This allows a specific option to apply to all instances of a package, and allowed me to finally remove both occurrences of the frysk package:

$ rpm -e --allmatches frysk-0.0.1.2006.10.02.rh1-1.fc6

RPM can be such a pain sometimes, and I am still baffled how it got itself into this in the first place.