Using the automated bug-reporting tool (abrt) to generate core dumps when a Linux process fails

Software fails, and it often occurs at the wrong time. When failures occur I want to understand why, and will usually start putting together the events that lead up to the issue. Some application issues can be root caused by reviewing logs, but catastrophic crashes will often require the admin to sit down with gdb and review a core file if it exists.

Solaris has always led the charge when it comes to reliably creating core files during crashes. System crashes will cause core files to be dumped to /var/crash, and the coreadm utility can be used to save application core files. Linux has been playing catch up in this realm, and just in the past couple of years started providing diskdump and netdump to generate kernel crash dumps when an OOPS occurs (in my experience these tools aren’t as reliable as their Solaris counterparts though).

In the latest releases of Fedora, the automated bug-reporting tool (abrt) infrastructure was added to generate core files when processes crashed. The abrt website provides the following description for their automated crash dump collection infrastructure:

“abrt is a daemon that watches for application crashes. When a crash occurs, it collects the crash data (core file, application’s command line etc.) and takes action according to the type of application that crashed and according to the configuration in the abrt.conf configuration file. There are plugins for various actions: for example to report the crash to Bugzilla, to mail the report, to transfer the report via FTP or SCP, or to run a specified application.”

This is a welcome addition to the Linux family, and it works pretty well from my initial testing. To see abrt in action I created a script and sent it a SIGSEGV:

$ cat loop

#!/bin/bash

while :
do
   sleep 1
done

$ ./loop &
[1] 22790

$ kill -SIGSEGV 22790
[1]+ Segmentation fault (core dumped) ./loop

Once the fault occurred I tailed /var/log/messages to get the crash dump location:

$ tail -20 /var/log/messages

Jan 17 09:52:08 theshack abrt[20136]: Saved core dump of pid 20126 (/bin/bash) to /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126 (487424 bytes)
Jan 17 09:52:08 theshack abrtd: Directory 'ccpp-2012-01-17-09:52:08-20126' creation detected
Jan 17 09:52:10 theshack abrtd: New problem directory /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126, processing

If I change into the directory referenced above I will see a wealth of debugging data, including a core file from the application (bash) that crashed:

$ cd /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126

$ ls -la

total 360
drwxr-x---. 2 abrt root   4096 Jan 17 09:52 .
drwxr-xr-x. 4 abrt abrt   4096 Jan 17 09:52 ..
-rw-r-----. 1 abrt root      5 Jan 17 09:52 abrt_version
-rw-r-----. 1 abrt root      4 Jan 17 09:52 analyzer
-rw-r-----. 1 abrt root      6 Jan 17 09:52 architecture
-rw-r-----. 1 abrt root     16 Jan 17 09:52 cmdline
-rw-r-----. 1 abrt root      4 Jan 17 09:52 component
-rw-r-----. 1 abrt root 487424 Jan 17 09:52 coredump
-rw-r-----. 1 abrt root      1 Jan 17 09:52 count
-rw-r-----. 1 abrt root    649 Jan 17 09:52 dso_list
-rw-r-----. 1 abrt root   2110 Jan 17 09:52 environ
-rw-r-----. 1 abrt root      9 Jan 17 09:52 executable
-rw-r-----. 1 abrt root      8 Jan 17 09:52 hostname
-rw-r-----. 1 abrt root     19 Jan 17 09:52 kernel
-rw-r-----. 1 abrt root   2914 Jan 17 09:52 maps
-rw-r-----. 1 abrt root     25 Jan 17 09:52 os_release
-rw-r-----. 1 abrt root     18 Jan 17 09:52 package
-rw-r-----. 1 abrt root      5 Jan 17 09:52 pid
-rw-r-----. 1 abrt root      4 Jan 17 09:52 pwd
-rw-r-----. 1 abrt root     51 Jan 17 09:52 reason
-rw-r-----. 1 abrt root     10 Jan 17 09:52 time
-rw-r-----. 1 abrt root      1 Jan 17 09:52 uid
-rw-r-----. 1 abrt root      5 Jan 17 09:52 username
-rw-r-----. 1 abrt root     40 Jan 17 09:52 uuid
-rw-r-----. 1 abrt root    620 Jan 17 09:52 var_log_messages

If we were debugging the crash we could poke around the saved environment files and then fire up gdb with the core dump to see where it crashed:

$ gdb /bin/bash coredump

GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 20126]
Core was generated by `/bin/bash ./loop'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bash-4.2.20-1.fc16.x86_64
(gdb) backtrace
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
#1  0x0000000000440679 in ?? ()
#2  0x00000000004417cf in wait_for ()
#3  0x0000000000432736 in execute_command_internal ()
#4  0x0000000000434dae in execute_command ()
#5  0x00000000004351c5 in ?? ()
#6  0x000000000043150b in execute_command_internal ()
#7  0x0000000000434dae in execute_command ()
#8  0x000000000041e0b1 in reader_loop ()
#9  0x000000000041c8ef in main ()

From there you could navigate through the saved memory image to see caused the program to die an unexpected death. Now you may be asking yourself how exactly does abrt work? After digging through the source code I figured out that it installs a custom hook using the core_pattern kernel entry point:

$ cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e

Each time a crash occurs the kernel will invoke the hook above passing a number of arguments to the program listed in the core_pattern proc entry. From what I have derived from the source so far there are currently hooks for C/C++ and Python applications, and work is in progress to add support for Java. This is really cool stuff

Dealing with xauth “error in locking authority file” errors

I recently logged into one of my servers and received the following error:

$ ssh foo
matty@foo’s password:
Last login: Tue Nov 1 13:42:52 2011 from 10.10.56.100
/usr/bin/xauth: error in locking authority file /home/matty/.Xauthority

I haven’t seen this one before, but based on previous “locking issues” I’ve encountered in the past I ran strace against xauth to see if an old lock file existed:

$ strace xauth

In the output I saw the following which confirmed my suspicion:

open(“/home/matty/.Xauthority-c”, O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)

A quick check with ls revealed the same thing:

$ ls -l /home/lnx/matty/.Xauthority*

-rw------- 2 matty SysAdmins 0 Nov  1 13:42 /home/lnx/matty/.Xauthority-c
-rw------- 2 matty SysAdmins 0 Nov  1 13:42 /home/lnx/matty/.Xauthority-l

I removed both of the lock files and haven’t seen the error again. Thought I would pass this on in case someone else encountered this issue.

One way to avoid tcpdump “packets dropped by kernel” messages

I have been knee deep this week debugging a rather complex DNS issue. I’ll do a full write up on that next week. While I was debugging the issue I needed to fire up tcpdump to watch the DNS queries from one of my authoritative servers to various servers on the Internet. What I noticed when I fed the data into wireshark were periods of time with no data, and I wasn’t quite sure why at first.

Based on what I could find on the tcpdump / BPF sites when tcpdump is busy processing existing data and is not able to take packets captured by BPF out of the queue fast enough, the kernel will drop them. If this occurs you will see the tcpdump message ” packets dropped by kernel” become non-zero:

$ tcpdump -w dns-capture.cap -s 1520 -ttt -vvv -i bond0 port 53
……
9559 packets captured
12533 packets received by filter
2974 packets dropped by kernel

I started to do some digging to see why tcpdump couldn’t keep up, and after a bit of profiling I noticed that the program was spending an excessive amount of time resolving IPs to names. This processing was stalling the program from reading more data from the queue, and resulted in packets being dropped. Once I ran tcpdump with the “-n” (do not resolve IPs to names) option I no longer experienced this issue:

$ tcpdump -w dns-capture.cap -s 1520 -ttt -vvv -n -i bond0 port 53
…..
9339 packets captured
9339 packets received by filter
0 packets dropped by kernel

This stopped gaps from occurring in my wireshark display, and since wireshark can resolve IPs to names all was well. It’s really crazy how you can start debugging one issue and wind up debugging 3 – 4 more prior to solving the original problem. Debugging issues is definitely fun, and I guess it gives me plenty to write about. :) This past week I’ve learned more about the DNS protocol and the various implementations than I have in the past 10 years. It’s amazing how many cool nuggets of information are buried in the various DNS RFCs!

Why isn’t Oracle using huge pages on my Redhat Linux server?

I am currently working on upgrading a number of Oracle RAC nodes from RHEL4 to RHEL5. After I upgraded the first node in the cluster, my DBA contacted me because the RHEL5 node was extremely sluggish. When I looked at top, I saw that a number of kswapd processes were consuming CPU:

$ top

top - 18:04:20 up 6 days,  3:22,  7 users,  load average: 14.25, 12.61, 14.41
Tasks: 536 total,   2 running, 533 sleeping,   0 stopped,   1 zombie
Cpu(s): 12.9%us, 19.2%sy,  0.0%ni, 20.9%id, 45.0%wa,  0.1%hi,  1.9%si,  0.0%st
Mem:  16373544k total, 16334112k used,    39432k free,     4916k buffers
Swap: 16777208k total,  2970156k used, 13807052k free,  5492216k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
  491 root      10  -5     0    0    0 D 55.6  0.0  67:22.85 kswapd0                         
  492 root      10  -5     0    0    0 S 25.8  0.0  37:01.75 kswapd1                         
  494 root      11  -5     0    0    0 S 24.8  0.0  42:15.31 kswapd3                         
 8730 oracle    -2   0 8352m 3.5g 3.5g S  9.9 22.4 139:36.18 oracle                          
 8726 oracle    -2   0 8352m 3.5g 3.5g S  9.6 22.5 138:13.54 oracle                          
32643 oracle    15   0 8339m  97m  92m S  9.6  0.6   0:01.31 oracle                          
  493 root      11  -5     0    0    0 S  9.3  0.0  43:11.31 kswapd2                         
 8714 oracle    -2   0 8352m 3.5g 3.5g S  9.3 22.4 137:14.96 oracle                          
 8718 oracle    -2   0 8352m 3.5g 3.5g S  8.9 22.3 137:01.91 oracle                          
19398 oracle    15   0 8340m 547m 545m R  7.9  3.4   0:05.26 oracle                          
 8722 oracle    -2   0 8352m 3.5g 3.5g S  7.6 22.5 139:18.33 oracle              

The kswapd process is responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. Periodic kswapd invocations are fine, but seeing kswapd continuosly appearing in the top output is a really really bad thing. Since this host should have had plenty of free memory, I was perplexed by the following output (the free output didn’t match up with the values on the other nodes):

$ free

             total       used       free     shared    buffers     cached
Mem:      16373544   16268540     105004          0       1520    5465680
-/+ buffers/cache:   10801340    5572204
Swap:     16777208    2948684   13828524

To start debugging the issue, I first looked at ipcs to see how much shared memory the database allocated. In the output below, we can see that there is a 128MB and a 8GB shared memory segment allocated:

$ ipcs -a

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x62e08f78 0          oracle    640        132120576  16                      
0xdd188948 32769      oracle    660        8592031744 87         

The first segment is dedicated to the Oracle ASM instance, and the second to the actual database. When I checked the number of huge pages allocated to the machine, I saw something a bit odd:

$ grep Huge cat /proc/meminfo

HugePages_Total:  4106
HugePages_Free:   4051
HugePages_Rsvd:      8
Hugepagesize:     2048 kB

While our DBA had set vm.nr_hugepages to a sufficiently large value in /etc/syscl.conf, the database was utilizing a very small portion of them. This meant that the database was being allocated out of non huge page memory (Linux dedicates memory to the huge page area, and it is wasted if nothing utilizes it), and inactive pages were being paged out to disk since the database wasn’t utilizing the huge page area we reserved for it . After a bit of bc’ing (I love doing my calculations with bc), I noticed that the total amount of memory allocated to huge pages was 8610906112 bytes:

$ grep vm.nr_hugepages /etc/sysctl.conf
vm.nr_hugepages=4106

$ bc
4106*(1024*1024*2)
8610906112

If we add the totals from the two shared memory segments above:

$ bc
8592031744+132120576
8724152320

We can see that we don’t have enough huge page memory to support both shared memory segments. Yikes! After adjusting vm.nr_hugepages to account for both databases, the system no longer swapped and database performance increased. This debugging adventure taught me a couple of things:

1. Double check system values people send you

2. Solaris does a MUCH better job of handling large page sizes (huge pages are used transparently)

3. The Linux tools for investigating huge page allocations are severely lacking

4. Oracle is able to allocate a continuos 8GB chunk of shared memory on RHEL5, but not RHEL4 (I need to do some research to find out why)

Hopefully more work will go into the Linux huge page implementation, and allow me to scratch the second and third items off of my list. Viva la problem resolution!

Using the Linux parted utility to re-create a lost partition table

I ran into an issue last week where two nodes using shared storage lost the partition table on one of the storage devices they were accessing. This was extremely evident in the output from fdisk:

$ fdisk -l /dev/sdb

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System

Fortunately the system in question still had the /proc/partition data in tact, so I stashed that in a safe location and came up with a plan to re-recreate the partition tables using this data. Given the following /proc/partitions data for the device listed above:

$ grep sdb /proc/partitions

   8    16  104857600 sdb
   8    17   99161148 sdb1
   8    18    5695042 sdb2

I can see that the device had two partion table entries, and the device was approximately 100GB in size. To re-create the partition tables, I used the parted mkpart command passing it the starting and ending sector numbers I wanted written to the partition table:

$ mkpart /dev/sdb

(parted) mkpart primary ext3 128s 198322424s

(parted) mkpart primary ext3 198322425s 209712509s

Now you may be asking yourself where did I get the starting and ending sectors from? In my specific case I start the partition tables at sector 128 to ensure alignment with the storage arrays we use, and the end sector was calculated by taking the partion table entry from /proc/partitions and multiplying it by 2 (the /proc/partitions sizes are in 1k chunks). I also had to add the 128 sector offset to this value, giving the resulting end sector for sdb1:

99161148 * 2 + 128 = 198322424 end sector for sdb1

To get the values for sdb2, I added one to the number produced above, and then multiplied the sdb2 value from /proc/partitions by two to get the following numbers:

198322424 + 1 = 198322425 start sector or sdb2

198322425 + 11390084 = 209712509 end sector of sdb2

Once I wrote out my changes, I verified that they were correct with sfdisk:

$ sfdisk -uS -l /dev/sdb

Disk /dev/sdb: 13054 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1           128 198322424  198322297  83  Linux
/dev/sdb2     198322425 209712509   11390085  83  Linux
/dev/sdb3             0         -          0   0  Empty
/dev/sdb4             0         -          0   0  Empty

And verified that /proc/partitions matched up with what was previously there:

$ grep sdb /proc/partitions

   8    16  104857600 sdb
   8    17   99161148 sdb1
   8    18    5695042 sdb2

Now that the partition table was restored, I used the fsck “-n” option to verify the file system was complete (make sure to read the fsck manual page before running it, as the “-n” option WILL modify some file systems!!):

$ fsck -n /dev/sdb1
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/sdb1: clean, 11/12402688 files, 437281/24790287 blocks

$ fsck -n /dev/sdb2
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/sdb2: clean, 11/712448 files, 57952/1423760 blocks

And then proceeded to mount them up:

$ mount /dev/sdb1 /a

$ mount /dev/sdb2 /b

$ df -h | grep sdb

/dev/sdb1              94G  188M   89G   1% /a
/dev/sdb2             5.4G  139M  5.0G   3% /b

I’m not a data recovery person / service (just a guy who reads a lot and understands the internals of how file systems and devices are structured), so use this information at your own risk!

Watching process creation on Linux hosts

I have been debugging a problem with Redhat cluster, and was curious if a specific process was getting executed. On my Solaris 10 hosts I can run execsnoop to observe system-wide process creation, but there isn’t anything comparable on my Linux hosts. The best I’ve found is systemtap, which provides the kprocess.exec probe to monitor exec()’s. To access this probe, you can stash the following in a file of your choosing:

probe kprocess.exec {
    printf("%s (pid: %d) is exec'ing %s\n", execname(), pid(), filename)
}

Once the file is created, you can execute the stap program to enable the exec probe:

$ stap -v exec.stp

This will produce output similar to the following:

modclusterd (pid: 5125) is exec'ing /usr/sbin/clustat
clurgmgrd (pid: 5129) is exec'ing /usr/share/cluster/clusterfs.sh
clusterfs.sh (pid: 5130) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5131) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5132) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5134) is exec'ing /bin/basename
clusterfs.sh (pid: 5135) is exec'ing /sbin/consoletype
clusterfs.sh (pid: 5138) is exec'ing /usr/bin/readlink

While systemtap is missing various features that are available in DTrace, it’s still a super useful tool!