Software fails, and it often occurs at the wrong time. When failures occur I want to understand why, and will usually start putting together the events that lead up to the issue. Some application issues can be root caused by reviewing logs, but catastrophic crashes will often require the admin to sit down with gdb and review a core file if it exists.
Solaris has always led the charge when it comes to reliably creating core files during crashes. System crashes will cause core files to be dumped to /var/crash, and the coreadm utility can be used to save application core files. Linux has been playing catch up in this realm, and just in the past couple of years started providing diskdump and netdump to generate kernel crash dumps when an OOPS occurs (in my experience these tools aren’t as reliable as their Solaris counterparts though).
In the latest releases of Fedora, the automated bug-reporting tool (abrt) infrastructure was added to generate core files when processes crashed. The abrt website provides the following description for their automated crash dump collection infrastructure:
“abrt is a daemon that watches for application crashes. When a crash occurs, it collects the crash data (core file, application’s command line etc.) and takes action according to the type of application that crashed and according to the configuration in the abrt.conf configuration file. There are plugins for various actions: for example to report the crash to Bugzilla, to mail the report, to transfer the report via FTP or SCP, or to run a specified application."*
This is a welcome addition to the Linux family, and it works pretty well from my initial testing. To see abrt in action I created a script and sent it a SIGSEGV:
$ cat loop
#!/bin/bash
while :; do
sleep 1
done
$ ./loop &
[1] 22790
$ kill -SIGSEGV 22790
[1]+ Segmentation fault (core dumped) ./loop
Once the fault occurred I tailed /var/log/messages to get the crash dump location:
$ tail -20 /var/log/messages
Jan 17 09:52:08 theshack abrt[20136]: Saved core dump of pid 20126 (/bin/bash) to /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126 (487424 bytes)
Jan 17 09:52:08 theshack abrtd: Directory 'ccpp-2012-01-17-09:52:08-20126' creation detected
Jan 17 09:52:10 theshack abrtd: New problem directory /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126, processing
If I change into the directory referenced above I will see a wealth of debugging data, including a core file from the application (bash) that crashed:
$ cd /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126
$ ls -la
total 360
drwxr-x---. 2 abrt root 4096 Jan 17 09:52 .
drwxr-xr-x. 4 abrt abrt 4096 Jan 17 09:52 ..
-rw-r-----. 1 abrt root 5 Jan 17 09:52 abrt_version
-rw-r-----. 1 abrt root 4 Jan 17 09:52 analyzer
-rw-r-----. 1 abrt root 6 Jan 17 09:52 architecture
-rw-r-----. 1 abrt root 16 Jan 17 09:52 cmdline
-rw-r-----. 1 abrt root 4 Jan 17 09:52 component
-rw-r-----. 1 abrt root 487424 Jan 17 09:52 coredump
-rw-r-----. 1 abrt root 1 Jan 17 09:52 count
-rw-r-----. 1 abrt root 649 Jan 17 09:52 dso_list
-rw-r-----. 1 abrt root 2110 Jan 17 09:52 environ
-rw-r-----. 1 abrt root 9 Jan 17 09:52 executable
-rw-r-----. 1 abrt root 8 Jan 17 09:52 hostname
-rw-r-----. 1 abrt root 19 Jan 17 09:52 kernel
-rw-r-----. 1 abrt root 2914 Jan 17 09:52 maps
-rw-r-----. 1 abrt root 25 Jan 17 09:52 os_release
-rw-r-----. 1 abrt root 18 Jan 17 09:52 package
-rw-r-----. 1 abrt root 5 Jan 17 09:52 pid
-rw-r-----. 1 abrt root 4 Jan 17 09:52 pwd
-rw-r-----. 1 abrt root 51 Jan 17 09:52 reason
-rw-r-----. 1 abrt root 10 Jan 17 09:52 time
-rw-r-----. 1 abrt root 1 Jan 17 09:52 uid
-rw-r-----. 1 abrt root 5 Jan 17 09:52 username
-rw-r-----. 1 abrt root 40 Jan 17 09:52 uuid
-rw-r-----. 1 abrt root 620 Jan 17 09:52 var_log_messages
If we were debugging the crash we could poke around the saved environment files and then fire up gdb with the core dump to see where it crashed:
$ gdb /bin/bash coredump
GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 20126]
Core was generated by `/bin/bash ./loop'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000344aabb83e in waitpid () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bash-4.2.20-1.fc16.x86_64
(gdb) backtrace
#0 0x000000344aabb83e in waitpid () from /lib64/libc.so.6
#1 0x0000000000440679 in ?? ()
#2 0x00000000004417cf in wait_for ()
#3 0x0000000000432736 in execute_command_internal ()
#4 0x0000000000434dae in execute_command ()
#5 0x00000000004351c5 in ?? ()
#6 0x000000000043150b in execute_command_internal ()
#7 0x0000000000434dae in execute_command ()
#8 0x000000000041e0b1 in reader_loop ()
#9 0x000000000041c8ef in main ()
From there you could navigate through the saved memory image to see caused the program to die an unexpected death. Now you may be asking yourself how exactly does abrt work? After digging through the source code I figured out that it installs a custom hook using the core_pattern kernel entry point:
$ cat /proc/sys/kernel/core_pattern |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e
Each time a crash occurs the kernel will invoke the hook above passing a number of arguments to the program listed in the core_pattern proc entry. From what I have derived from the source so far there are currently hooks for C/C++ and Python applications, and work is in progress to add support for Java. This is really cool stuff