One of things I love about Solaris is its ability to generate a core file when a system panics. The core files are an invaluable resource for figuring out what caused a host to panic, and are often the first thing OS vendor support organizations will request when you open a support case. Linux provides the kdump, diskdump and netdump tools to collect core file when a systems panics, and although not quite as seamless as their Solaris counterpart, they work relatively well.
I’m not a huge fan of diskdump and netdump, since they have special pre-requisites (i.e., operational networking, supported storage controller, etc.) that need to be met to ensure a core file is captured. Kdump does not. Kdump works by reserving a chunk of memory for a crash kernel, and then rebooting into this kernel when a box panics. Since the crashkernel uses a chunk of memory that is unused and reserved for this specific purpose, it can be sure that the memory it is using won’t taint the previous kernel. This approach also provides full access to the previous kernel’s memory, which is read and written off to disk or a network accessible location.
To configure kdump to write core files to the /var/crash directory on local disk, you will first need to install the kexec-tools package:
yum install kexec-tools
Once the package is installed, you will need to add a crashkernel line to the kernel boot arguments. This line contains the amount of memory to reserve for the crashkernel, and should look similar to the following (you may need to increase the amount of memory depending on the platform you are using):
title CentOS (2.6.18-128.el5) root (hd0,0) kernel /boot/vmlinuz-2.6.18-128.el5 ro root=LABEL=/ console=ttyS0 crashkernel=128M@16M initrd /boot/initrd-2.6.18-128.el5.img
To allow you to get a core file if a box hangs, you can enable sysrq magic key sequences by setting “kernel.sysrq” to 1 in /etc/sysctl.conf (you can also use the sysctl “-w” option to enable this feature on an active host):
kernel.sysrq = 1
Once these settings are in place, you can enable the kdump service with the chkconfig and service commands:
chkconfig kdump on
service kdump start
If you want to verify that kdump is working, you can type “alt + sysrq + c” on the console, or echo a “c” character to the sysrq-trigger proc entry:
echo "c" > /proc/sysrq-trigger
SysRq : Trigger a crashdump Linux version 2.6.18-128.el5 (firstname.lastname@example.org) ....
This will force a panic, which should result in a core file being generated in the /var/crash directory:
total 317240 drwxr-xr-x 2 root root 4096 Jul 5 18:32 . drwxr-xr-x 3 root root 4096 Jul 5 18:31 .. -r-------- 1 root root 944203448 Jul 5 18:32 vmcore
If you are like me and prefer to be notified when a box panics, you can configure your log monitoring solution to look for the string “kdump: saved a vmcore” in /var/log/messages:
Jul 5 18:32:08 kvmnode1 kdump: saved a vmcore to /var/crash/2009-07-05-18:31
Kdump is pretty sweet, and it’s definitely one of those technologies that every RAS savy engineer should be configuring on each server he or she deploys.