Recovering from Solaris hangs with the deadman timer

Periodically a nasty bug will rear it’s head with Solaris or the latest build of Nevada, and the operating system will hang for no apparent reason. Recovering from a hang typically requires the administrator to reboot the host, which can delay the time it takes to get the system back to a working state. One nice feature built into Solaris to assist with system hangs is the deadman timer. When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second, which will in turn cause the kernel lbolt variable to be updated. If the deadman timer detects that that lbolt variable hasn’t changed for a period of time (the default is 500 seconds), it will induce a panic, which will cause a core file to be written to /var/crash (or the location you configured with dumpadm). To enable the deadman timer, you can set the “snooping” variable to 1 in /etc/system:

set snooping=1

If you would like the deadman to wait more (or less) than 500 seconds prior to inducing a panic, you can set the “snoop_interval” variable to the desired number of seconds * 100000 (the following example will induce a panic if the lbolt variable hasn’t been updated after 90-seconds):

set snoop_interval=9000000

This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)

Not grep!

While reviewing some shell scripts last week, I saw the infamous find | grep:

$ /usr/bin/find /foo -type f | egrep -v \*.inp

I am not real sure why more people don’t leverage the logic operations build into find:

$ /usr/bin/find /foo -type f -not -name \*.inp

This saves a fork() and exec(), and should be a bit faster. I am curious if folks use grep because it’s easier to read, or because they don’t know about the logic operations built into find. I shall need to investigate …

Using the ZFS RAIDZ2 and hot spare features on old storage arrays

I support some antiquated Sun disk arrays (D1000s, T3s, A5200s, etc), and due to the age of the hardware, it is somewhat prone to failure. FMA helps prevent outages due to CPU and memory problems, but it doesn’t yet support diagnosing disk drive errors. Since disk drives are prone to failure, I have started creating RAIDZ2 (dual parity RAIDZ) pools with multiple hot spares to protect our data. RAIDZ2 and hot spares are available in the 11/06 release of Solaris 10, and are both super easy to configure.

To create a RAIDZ2 pool, you can run the zpool utility with the “create” option, the “raidz2” keyword, the name of the pool to create, and the disks to add to the pool:

$ zpool create rz2pool raidz2 c1t9d0 c1t10d0 c1t12d0 c2t1d0 c2t2d0

Once the pool is created, the layout can be viewed with the zpool utility:

$ zpool status

  pool: rz2pool
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        rz2pool      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0
            c1t12d0  ONLINE       0     0     0
            c2t1d0   ONLINE       0     0     0
            c2t2d0   ONLINE       0     0     0

errors: No known data errors

RAIDZ2 allows 2 disks in a pool to fail without data loss, which is ideal for sites that are more concerned with data integrity than performance. On my ancient storage subsystems, I like to combine RAIDZ2 with several hot spares to allow the pool to automatically recover each time a disk bites the dust. To add one or more hot spares to a pool, you can run the zpool utility with the “add” option, the “spare” keyword, and the device to turn into a spare:

$ zpool add rz2pool spare c2t3d0

  pool: rz2pool
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        rz2pool      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0
            c1t12d0  ONLINE       0     0     0
            c2t1d0   ONLINE       0     0     0
            c2t2d0   ONLINE       0     0     0
        spares
          c2t3d0     AVAIL   

errors: No known data errors

Luckily we are using most of the storage listed above in development, QE and testing environments, so performance isn’t super critical (downtime is much more costly).

Removing two RPMs with the same name

While performing an upgrade this weekend, I ran into an issue where a few packages were showing up twice in the rpm query output:

$ rpm -q -a | grep frysk
frysk-0.0.1.2006.10.02.rh1-1.fc6
frysk-0.0.1.2006.10.02.rh1-1.fc6

This was causing several package installations to fail, and was preventing me from upgrading several packages with known security issues. When I tried to remove the frysk package, I was greeted with the following error message:

$ rpm -e frysk-0.0.1.2006.10.02.rh1-1.fc6
error: “frysk-0.0.1.2006.10.02.rh1-1.fc6” specifies multiple packages

After reading through the rpm manual page, I came across the “–allmatches” option. This allows a specific option to apply to all instances of a package, and allowed me to finally remove both occurrences of the frysk package:

$ rpm -e –allmatches frysk-0.0.1.2006.10.02.rh1-1.fc6

RPM can be such a pain sometimes, and I am still baffled how it got itself into this in the first place.

Customizing PHP builds

A few weeks back I helped a friend build PHP on a server with a non-standard directory structure. Changing the structure to use common defaults wasn’t an option, so we needed to adjust the PHP configure script to point to the pertinent places. Here is what we came up with:

$ export CPPFLAGS=”-I/home/apps/include -I/home/apps/include/mysql”

$ export LDFLAGS=”-L/home/apps/lib -L/home/apps/lib/mysql”

$ ./configure \\
     --prefix=/home/apps/sfw/php-5.1.4 \\
     --with-apxs2=/home/apps/httpd/bin/apxs \\
     --with-libxml=/home/apps \\
     --with-libxml-dir=/home/apps \\
     --with-mysql=/home/apps \\
     --with-zlib=/home/apps

This will build PHP using an apxs utility that resides in /home/apps/bin/apxs, and will look for the MySQL, libxml and zlib libraries and headers in /home/apps/lib and /home/apps/include.

Accessing Windows shares from the Solaris/Linux command line

Periodically I need to access a Windows share from a Solaris or Linux box. If Samba is installed on the system, this is easy to do with the smbclient utility. To access the Windows server named “milton” from the command line, you can run smbclient with the “-U” option, the name of the user to authenticate with, and the name of the server and share to access:

$ smbclient -U “domain\matty” //milton/foo

In this example, I am authenticating as the user matty in the domain “domain,” and accessing the share foo on the server milton. If smbclient is unable to resolve the server, you will need to make sure that you have defined a WINS server, or the server exists in the lmhosts file. To define a WINS server, you can add a line similar to the following (you can get the WINS server by looking at ipconfig /all on a Windows desktop, or by reviewing the LAN traffic with ethereal) to the smb.conf file:

wins server = 1.2.3.4

If you don’t want to use WINS to resolve names, you can add an entry similar to the following to the lmhosts file:

192.168.1.200 milton

Once you are connected to the server, you will be greeted with a “smb: \>” prompt. This prompt allows you to feed commands to the server, such as “pwd,” “dir,” “mget,” and “prompt.” To retrieve all of the files in the directory foo1, I can “cd” into the foo1 directory, use “prompt” to disable interactive prompts, and then run “mget” to retrieve all files in that directory:

smb: \> pwd
Current directory is \\server1\foo

smb: \> dir

received 10 entries (eos=1)
  .                                  DA        0  Mon May 22 07:19:21 2006
  ..                                 DA        0  Mon May 22 07:19:21 2006
  foo1                               DA        0  Sun Dec 11 04:51:12 2005
  foo2                               DA        0  Thu Nov  9 09:48:40 2006
         < ..... >

smb: \> cd foo1

smb: \foo1\> prompt
prompting is now off

smb: \foo1\> mget *

received 38 entries (eos=1)
getting file \foo1\yikes.tar of size 281768 as yikes.tar 411.3 kb/s) (average 411.3 kb/s)
    < ..... >

smb: \foo1\> exit

The smbclient manual page documents all of the available commands, and provides a great introduction to this super useful utility. If you bump into any issues connecting to a remote Windows server, you can add “-d” and a debug level (I like debug level 3) to the smbclient command line. This is perfect for debugging connectivity issues.