Using the automated bug-reporting tool (abrt) to generate core dumps when a Linux process fails

Software fails, and it often occurs at the wrong time. When failures occur I want to understand why, and will usually start putting together the events that lead up to the issue. Some application issues can be root caused by reviewing logs, but catastrophic crashes will often require the admin to sit down with gdb and review a core file if it exists.

Solaris has always led the charge when it comes to reliably creating core files during crashes. System crashes will cause core files to be dumped to /var/crash, and the coreadm utility can be used to save application core files. Linux has been playing catch up in this realm, and just in the past couple of years started providing diskdump and netdump to generate kernel crash dumps when an OOPS occurs (in my experience these tools aren’t as reliable as their Solaris counterparts though).

In the latest releases of Fedora, the automated bug-reporting tool (abrt) infrastructure was added to generate core files when processes crashed. The abrt website provides the following description for their automated crash dump collection infrastructure:

“abrt is a daemon that watches for application crashes. When a crash occurs, it collects the crash data (core file, application’s command line etc.) and takes action according to the type of application that crashed and according to the configuration in the abrt.conf configuration file. There are plugins for various actions: for example to report the crash to Bugzilla, to mail the report, to transfer the report via FTP or SCP, or to run a specified application.”

This is a welcome addition to the Linux family, and it works pretty well from my initial testing. To see abrt in action I created a script and sent it a SIGSEGV:

$ cat loop

#!/bin/bash

while :
do
   sleep 1
done

$ ./loop &
[1] 22790

$ kill -SIGSEGV 22790
[1]+ Segmentation fault (core dumped) ./loop

Once the fault occurred I tailed /var/log/messages to get the crash dump location:

$ tail -20 /var/log/messages

Jan 17 09:52:08 theshack abrt[20136]: Saved core dump of pid 20126 (/bin/bash) to /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126 (487424 bytes)
Jan 17 09:52:08 theshack abrtd: Directory 'ccpp-2012-01-17-09:52:08-20126' creation detected
Jan 17 09:52:10 theshack abrtd: New problem directory /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126, processing

If I change into the directory referenced above I will see a wealth of debugging data, including a core file from the application (bash) that crashed:

$ cd /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126

$ ls -la

total 360
drwxr-x---. 2 abrt root   4096 Jan 17 09:52 .
drwxr-xr-x. 4 abrt abrt   4096 Jan 17 09:52 ..
-rw-r-----. 1 abrt root      5 Jan 17 09:52 abrt_version
-rw-r-----. 1 abrt root      4 Jan 17 09:52 analyzer
-rw-r-----. 1 abrt root      6 Jan 17 09:52 architecture
-rw-r-----. 1 abrt root     16 Jan 17 09:52 cmdline
-rw-r-----. 1 abrt root      4 Jan 17 09:52 component
-rw-r-----. 1 abrt root 487424 Jan 17 09:52 coredump
-rw-r-----. 1 abrt root      1 Jan 17 09:52 count
-rw-r-----. 1 abrt root    649 Jan 17 09:52 dso_list
-rw-r-----. 1 abrt root   2110 Jan 17 09:52 environ
-rw-r-----. 1 abrt root      9 Jan 17 09:52 executable
-rw-r-----. 1 abrt root      8 Jan 17 09:52 hostname
-rw-r-----. 1 abrt root     19 Jan 17 09:52 kernel
-rw-r-----. 1 abrt root   2914 Jan 17 09:52 maps
-rw-r-----. 1 abrt root     25 Jan 17 09:52 os_release
-rw-r-----. 1 abrt root     18 Jan 17 09:52 package
-rw-r-----. 1 abrt root      5 Jan 17 09:52 pid
-rw-r-----. 1 abrt root      4 Jan 17 09:52 pwd
-rw-r-----. 1 abrt root     51 Jan 17 09:52 reason
-rw-r-----. 1 abrt root     10 Jan 17 09:52 time
-rw-r-----. 1 abrt root      1 Jan 17 09:52 uid
-rw-r-----. 1 abrt root      5 Jan 17 09:52 username
-rw-r-----. 1 abrt root     40 Jan 17 09:52 uuid
-rw-r-----. 1 abrt root    620 Jan 17 09:52 var_log_messages

If we were debugging the crash we could poke around the saved environment files and then fire up gdb with the core dump to see where it crashed:

$ gdb /bin/bash coredump

GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 20126]
Core was generated by `/bin/bash ./loop'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bash-4.2.20-1.fc16.x86_64
(gdb) backtrace
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
#1  0x0000000000440679 in ?? ()
#2  0x00000000004417cf in wait_for ()
#3  0x0000000000432736 in execute_command_internal ()
#4  0x0000000000434dae in execute_command ()
#5  0x00000000004351c5 in ?? ()
#6  0x000000000043150b in execute_command_internal ()
#7  0x0000000000434dae in execute_command ()
#8  0x000000000041e0b1 in reader_loop ()
#9  0x000000000041c8ef in main ()

From there you could navigate through the saved memory image to see caused the program to die an unexpected death. Now you may be asking yourself how exactly does abrt work? After digging through the source code I figured out that it installs a custom hook using the core_pattern kernel entry point:

$ cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e

Each time a crash occurs the kernel will invoke the hook above passing a number of arguments to the program listed in the core_pattern proc entry. From what I have derived from the source so far there are currently hooks for C/C++ and Python applications, and work is in progress to add support for Java. This is really cool stuff

Using exec-shield to protect your Linux servers from stack, heap and integer overflows

I’ve been a long time follower of the OpenBSD project, and their amazing work on detecting and protecting the kernel and applications from stack and heap overflows. Several of the concepts that were developed by the OpenBSD team were made available in Linux, and came by way of the exec-shield project. Of the many useful security features that are part of exec-shield, the two features that can be controlled by a SysAdmin are kernel virtual address space randomizations and the exec-shield operating mode.

Address space randomization are controlled through the kernel.randomize_va_space sysctl tunable, which defaults to 1 on my CentOS systems:

$ sysctl kernel.randomize_va_space
kernel.randomize_va_space = 1

The exec-shield operating mode is controlled through the kernel.exec-shield sysctl value, and can be set to one of the following four modes (the descriptions below came from Steve Grubb’s excellent post on exec-shield operating modes):

– A value of 0 completely disables ExecShield and Address Space Layout Randomization
– A value of 1 enables them ONLY if the application bits for these protections are set to “enable”
– A value of 2 enables them by default, except if the application bits are set to “disable”
– A value of 3 enables them always, whatever the application bits

The default exec-shield value on my CentoOS servers is 1, which enables exec-shield for applications that have been compiled to support it:

$ sysctl kernel.exec-shield
kernel.exec-shield = 1

To view the list of running processes that have exec-shield enabled, you can run Ingo Molnar and Ulrich Drepper’s lsexec utility:

$ lsexec –all |more

init, PID      1, UID root: no PIE, no RELRO, execshield enabled
httpd, PID  11689, UID apache: DSO, no RELRO, execshield enabled
httpd, PID  11691, UID apache: DSO, no RELRO, execshield enabled
httpd, PID  11692, UID apache: DSO, no RELRO, execshield enabled
httpd, PID  11693, UID apache: DSO, no RELRO, execshield enabled
httpd, PID  12224, UID apache: DSO, no RELRO, execshield enabled
httpd, PID  12236, UID apache: DSO, no RELRO, execshield enabled
pickup, PID  16181, UID postfix: DSO, partial RELRO, execshield enabled
appLoader, PID   2347, UID root: no PIE, no RELRO, execshield enabled
auditd, PID   2606, UID root: DSO, partial RELRO, execshield enabled
audispd, PID   2608, UID root: DSO, partial RELRO, execshield enabled
restorecond, PID   2629, UID root: DSO, partial RELRO, execshield enabled

In this day and age of continuos security threats there is little to no reason that you shouldn’t be using these amazing technologies. When you combine exec-shield, SELinux and proper patching and security best practices you can really limit the attack vectors that can be used to break into your systems.

Fcron, a feature rich cron and anacron replacement

I’ve been looking at some opensource scheduling packages, and while doing my research I came across the fcron package. Fcron is a replacement for vixie cron and anacron, and provides a number of super useful features:

– Run jobs based on the system load average.
– Serialize jobs.
– Set the nice value of the process that is fork()’ed.
– Options to control how results are mailed.
– And several more …

My initial testing has been positive, and I definitely plan to keep this package in my back pocket. I’m still looking at various opensource schedulers, and if you have any experience in this area please leave me a comment. I’m curious which solutions worked well for my readers. :)

Configuring wget to use a proxy server

Periodically I need to download files on servers that aren’t directly connected to the Internet. If the server has wget installed I will usually execute it passing it the URL of the resource I want to retrieve:

$ wget prefetch.net/iso.dvd

If the system resides behind a proxy server the http_proxy variable needs to be set to the server name and port of the proxy:

$ export http_proxy=proxy.prefetch.net:3128

If your proxy requires a username and password you can pass those on the command line:

$ wget –proxy-user=foo –proxy-password=bar prefetch.net

Or you can set the proxy-user and proxy-password variables in your ~/.wgetrc file.

Four super cool utilities that are part of the psmisc package

There are a ton of packages available for the various Linux distributions. Some of these packages aren’t as well know as others, though they contain some crazy awesome utilities. One package that fits into this cataegory is psmisc. Psmisc contains several tools that be used to print process statistics, look at file descriptor activity, see which process ids have a file or directory open, kill all processes that match a pattern and print the process table as a tree. Here is the list of utilities provided by psmisc:

$ rpm -q -l psmisc | grep bin
/sbin/fuser
/usr/bin/killall
/usr/bin/peekfd
/usr/bin/prtstat
/usr/bin/pstree
/usr/bin/pstree.x11

Starting at the top, the fuser tool will allow you to view the processes that have a file or directory open:

$ fuser -c -v /home/matty

/home/matty:         root     kernel mount /home
                     matty      1652 F.c.. sh
                     matty      1723 ....m imsettings-daem
                     matty      1810 F.c.m gnome-screensav
                     matty      1812 F.c.. xfce4-session
                     matty      1818 F.c.. xfsettingsd
                     .....

Peekfd will display reads and writes to a set of file descriptors passed on the command line, or all of the file descriptors opened by a process:

$ peekfd 24983

reading fd 0:
c
writing fd 2:
c
reading fd 0:

writing fd 2:
 [08]  [1b] [K

Next up we have prtstat. This super handy tool will display the contents of /proc/<pid>/stat in a nicely formatted ascii visual display:

$ prtstat 16860

Process: bash          		State: S (sleeping)
  CPU#:  0  		TTY: 136:1	Threads: 1
Process, Group and Session IDs
  Process ID: 16860		  Parent ID: 16855
    Group ID: 16860		 Session ID: 16860
  T Group ID: 16860

Page Faults
  This Process    (minor major):      996         1
  Child Processes (minor major):     6651         0
CPU Times
  This Process    (user system guest blkio):   0.00   0.01   0.00   0.00
  Child processes (user system guest):        14.30  21.06   0.00
Memory
  Vsize:       119 MB    
  RSS:         716 kB     		 RSS Limit: 18446744073709 MB
  Code Start:  0x400000  		 Code Stop:  0x4d8528  
  Stack Start: 0x7fff4b027150
  Stack Pointer (ESP): 0x7fff4b025dd8	 Inst Pointer (EIP): 0x3d0c0d3090
Scheduling
  Policy: normal
  Nice:   0 		 RT Priority: 0 (non RT)

And the last tool I love from this package is pstree. Pstree will print a tree-like structure of the process table, allowing you to easily see what started a given process:

$ pstree –ascii |more

systemd-+-NetworkManager-+-dhclient
        |                `-2*[{NetworkManager}]
        |-Terminal-+-bash-+-more
        |          |      `-pstree
        |          |-bash
        |          |-bash---ssh
        |          |-gnome-pty-helpe
        |          `-{Terminal}
        |-Thunar---2*[{Thunar}]
        |-VBoxSVC-+-VBoxNetDHCP
        |         |-VirtualBox---22*[{VirtualBox}]
        |         |-2*[VirtualBox---21*[{VirtualBox}]]
        |         `-10*[{VBoxSVC}]
        .....

So there you go my friends, four amazing tools that don’t get the recognition they deserve. Which packages do you use that don’t get the street cred they deserve?

Figuring out how long a Linux process has been alive

I’ve bumped into a few problems in the past where processes that were supposed to be short lived encountered an issue and never died. Over time these processes would build up and if it wasn’t for a cleanup task I developed the process table would have eventually filled up (the bug that caused this was eventually fixed).

Now how would you go about checking to see how long a process has been alive? There are actually several ways to get the time a process started on a Linux host. You can look at the 5th field in the SYSV ps output:

$ ps -ef | tail -5

matty    29501 28486  0 Nov02 pts/6    00:00:00 ssh 192.168.56.101
matty    29666 28085  0 Nov02 pts/7    00:00:00 bash
matty    29680 29666  0 Nov02 pts/7    00:00:00 vim
root     29854     2  0 Oct31 ?        00:00:00 [kdmflush]
matty    29986 20521  0 Nov02 ?        00:00:07 java

You can get a space separate date range with the “bsdstart” option:

$ ps ax -o pid,command,bsdstart,bsdtime | tail -5

29501 ssh 192.168.56.101          Nov  2   0:00
29666 bash                        Nov  2   0:00
29680 vim                         Nov  2   0:00
29854 [kdmflush]                  Oct 31   0:00
29986 java                        Nov  2   0:07

Or you can get the full date a process started with the “lstart” option:

$ ps ax -o pid,command,lstart | tail -5

29501 ssh 192.168.56.101          Wed Nov  2 14:16:23 2011
29666 bash                        Wed Nov  2 14:56:48 2011
29680 vim                         Wed Nov  2 14:56:49 2011
29854 [kdmflush]                  Mon Oct 31 10:56:54 2011
29986 java                        Wed Nov  2 15:54:05 2011

Now you may be asking yourself where does ps get the time from? It opens /proc//status and reads the starttime value to get the time in jiffies that the process was started after boot (see proc(5) for further detail). I’m sure there are numerous other ways to visualize the time a process has been running, but the ones listed above have been sufficient to deal with most aging issues (aka. bugs) I’ve encountered. :)