Recovering from Solaris hangs with the deadman timer


Periodically a nasty bug will rear it’s head with Solaris or the latest build of Nevada, and the operating system will hang for no apparent reason. Recovering from a hang typically requires the administrator to reboot the host, which can delay the time it takes to get the system back to a working state. One nice feature built into Solaris to assist with system hangs is the deadman timer. When enabled, the deadman timer will cause a level 15 interrupt to fire on each CPU every second, which will in turn cause the kernel lbolt variable to be updated. If the deadman timer detects that that lbolt variable hasn’t changed for a period of time (the default is 500 seconds), it will induce a panic, which will cause a core file to be written to /var/crash (or the location you configured with dumpadm). To enable the deadman timer, you can set the “snooping” variable to 1 in /etc/system:

set snooping=1

If you would like the deadman to wait more (or less) than 500 seconds prior to inducing a panic, you can set the “snoop_interval” variable to the desired number of seconds 100000 (the following example will induce a panic if the lbolt variable hasn’t been updated after 90-seconds):

set snoop_interval=9000000

This is a great feature, and can help isolate nasty bugs that result in system hangs. Since this feature CAN result in a system panic, you should take this into account prior to using it. The author is not liable for misuse. ;)

This article was posted by Matty on 2007-02-11 10:16:00 -0400 -0400