Getting more out of your Linux servers with moreutils

I accidentally came across the moreutils package a few years back and the day I did my life changed forever. This package contains some killer utilities and fill some nice gaps in the *NIX tool chain. Here is a list of the binaries in this package (descriptions taken from the man page of each utility):

chronic - runs a command quietly unless it fails
combine - combine sets of lines from two files using boolean operations
errno - look up errno names and descriptions
ifdata - get network interface info without parsing ifconfig output
ifne - Run command if the standard input is not empty
isutf8 - check whether files are valid UTF-8
lckdo - run a program with a lock held
mispipe - pipe two commands, returning the exit status of the first
pee - tee standard input to pipes
sponge - soak up standard input and write to a file
ts - timestamp input
vidir - edit directory
vipe - edit pipe
zrun - automatically uncompress arguments to command

I’m especially fond of errno, chronic and pee. But my favorite utilities have to be ifne and ts. Ifne is useful if you need to run a command if output is present. One such use is e-mailing someone if a monitoring program spits out an error:

$ hardware_monitor | mail -s “Problem detected with the hardare on `/bin/hostname` admins

The ts utility is just as useful. Say you have a program that randomly spits out lines of gik and you want to know when the lines of gik occurred. To get a timestamp you can pipe the programs output to ts:

$ gik_generator | ts
Nov 02 09:55:11 The world needs more cow bell!
Nov 02 09:55:12 The world needs more cow bell!
Nov 02 09:55:13 The world needs more cow bell!

I love coming across tools that make troubleshooting and shell scripting more enjoyable. Now we just need more cow bell!

Viewing Linux tape drive statistics with tapestat

A while back I wrote a blog entry showing how to get tape drives statistics with systemtap. This script wasn’t very reliable and I would frequently see it crash after collecting just a few samples. Due to the work of some amazing Linux kernel engineers I no longer have to touch systemtap. Recent Linux kernels now expose a number of incredibly useful statistics through the /sys file system:

$ pwd
/sys/class/scsi_tape/nst0/stats

$ ls -l
total 0
-r--r--r-- 1 root root 4096 Oct 10 16:15 in_flight
-r--r--r-- 1 root root 4096 Oct 10 16:15 io_ns
-r--r--r-- 1 root root 4096 Oct 10 16:15 other_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 read_byte_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 read_cnt
-r--r--r-- 1 root root 4096 Oct 10 16:15 read_ns
-r--r--r-- 1 root root 4096 Oct 10 16:15 resid_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 write_byte_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 write_cnt
-r--r--r-- 1 root root 4096 Oct 10 16:15 write_ns

There is also a tapestats utility in the syststat package that can be used to summarize these statistics:

$ tapestat -z 1

Linux 2.6.32-642.1.1.el6.x86_64 (wolfie)        10/10/2016      _x86_64_        (24 CPU)

Tape:     r/s     w/s   kB_read/s   kB_wrtn/s %Rd %Wr %Oa    Rs/s    Ot/s
st0         0     370           0       94899   0  22  22       0       0
st1         0     367           0       93971   0  18  19       0       0
st2         0     315           0       80885   0  19  19       0       0
st3         0      27           0        6979   0   1   1       0       0

Tape:     r/s     w/s   kB_read/s   kB_wrtn/s %Rd %Wr %Oa    Rs/s    Ot/s
st0         0     648           0      165888   0  30  30       0       0
st2         0     362           0       92928   0  17  17       0       0

This is a useful addition and I no longer have to worry about systemtap croaking when I’m tracking down issues.

Sudo insults — what a fun feature!

I think humor plays a big role in life, especially the life of a SysAdmin. This weekend I was cleaning up some sudoers files and came across a reference to the “insult” option in the documentation. Here is what the manual says:

insults If set, sudo will insult users when they enter an incorrect password. This flag is off by default.”

This of course peaked my curiosity, and the description in the online documentation got me wondering what kind of insults sudo would spit out. To test this feature out I compiled sudo with the complete set of insults:

$ ./configure –prefix=/usr/local –with-insults –with-all-insults

$ gmake

$ gmake install

To enable insults I added “Defaults insults” to my sudoers file. This resulted in me laughing myself silly:

$ sudo /bin/false

Password: 
Take a stress pill and think things over.
Password: 
This mission is too important for me to allow you to jeopardize it.
Password: 
I feel much better now.
sudo: 3 incorrect password attempts

$ sudo /bin/false

Password: 
Have you considered trying to match wits with a rutabaga?
Password: 
You speak an infinite deal of nothing
Password: 
You speak an infinite deal of nothing
sudo: 3 incorrect password attempts

Life without laughter is pretty much a useless life. You can quote me on that one! ;)

Summarizing system call activity on Linux systems

Linux has a guadzillion debugging utilities available. One of my favorite tools for debugging problems is strace, which allows you to observe the system calls a process is making in realtime. Strace also has a “-c” option to summarize system call activity:

$ strace -c -p 28009

Process 28009 attached
Process 28009 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.17    0.001278           2       643           select
 14.80    0.000242           1       322           read
  7.03    0.000115           0       322           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.001635                  1287           total

The output contains the percentage of time spent in each system call, the total time in seconds, the microseconds per call, the total number of calls, a count of the number of errors and the type of system call that was made. This output has numerous uses, and is a great way to observe how a process is interacting with the kernel. Good stuff!

Breaking the telnet addiction with netcat

After many years of use it’s become almost second nature to type ‘telnet <HOST> <PORT>’ when I need to see if a system has TCP port <PORT> open. Newer systems no longer install telnet by default:

$ telnet google.com 80
-bash: telnet: command not found

I can’t think of a valid reason to keep telnet around (there are probably valid use cases). Since netcat and tcpdump are a billion times better for debugging TCP issues, I need to apply newer microcode to my brain to perform a ‘s/telnet/nc -v/g’ each time I need to test if a TCP port is open:

$ nc -v google.com 80
Connection to google.com 80 port [tcp/http] succeeded!

Anyone else have a telnet attachment they just can’t break? :)

Using the automated bug-reporting tool (abrt) to generate core dumps when a Linux process fails

Software fails, and it often occurs at the wrong time. When failures occur I want to understand why, and will usually start putting together the events that lead up to the issue. Some application issues can be root caused by reviewing logs, but catastrophic crashes will often require the admin to sit down with gdb and review a core file if it exists.

Solaris has always led the charge when it comes to reliably creating core files during crashes. System crashes will cause core files to be dumped to /var/crash, and the coreadm utility can be used to save application core files. Linux has been playing catch up in this realm, and just in the past couple of years started providing diskdump and netdump to generate kernel crash dumps when an OOPS occurs (in my experience these tools aren’t as reliable as their Solaris counterparts though).

In the latest releases of Fedora, the automated bug-reporting tool (abrt) infrastructure was added to generate core files when processes crashed. The abrt website provides the following description for their automated crash dump collection infrastructure:

“abrt is a daemon that watches for application crashes. When a crash occurs, it collects the crash data (core file, application’s command line etc.) and takes action according to the type of application that crashed and according to the configuration in the abrt.conf configuration file. There are plugins for various actions: for example to report the crash to Bugzilla, to mail the report, to transfer the report via FTP or SCP, or to run a specified application.”

This is a welcome addition to the Linux family, and it works pretty well from my initial testing. To see abrt in action I created a script and sent it a SIGSEGV:

$ cat loop

#!/bin/bash

while :
do
   sleep 1
done

$ ./loop &
[1] 22790

$ kill -SIGSEGV 22790
[1]+ Segmentation fault (core dumped) ./loop

Once the fault occurred I tailed /var/log/messages to get the crash dump location:

$ tail -20 /var/log/messages

Jan 17 09:52:08 theshack abrt[20136]: Saved core dump of pid 20126 (/bin/bash) to /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126 (487424 bytes)
Jan 17 09:52:08 theshack abrtd: Directory 'ccpp-2012-01-17-09:52:08-20126' creation detected
Jan 17 09:52:10 theshack abrtd: New problem directory /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126, processing

If I change into the directory referenced above I will see a wealth of debugging data, including a core file from the application (bash) that crashed:

$ cd /var/spool/abrt/ccpp-2012-01-17-09:52:08-20126

$ ls -la

total 360
drwxr-x---. 2 abrt root   4096 Jan 17 09:52 .
drwxr-xr-x. 4 abrt abrt   4096 Jan 17 09:52 ..
-rw-r-----. 1 abrt root      5 Jan 17 09:52 abrt_version
-rw-r-----. 1 abrt root      4 Jan 17 09:52 analyzer
-rw-r-----. 1 abrt root      6 Jan 17 09:52 architecture
-rw-r-----. 1 abrt root     16 Jan 17 09:52 cmdline
-rw-r-----. 1 abrt root      4 Jan 17 09:52 component
-rw-r-----. 1 abrt root 487424 Jan 17 09:52 coredump
-rw-r-----. 1 abrt root      1 Jan 17 09:52 count
-rw-r-----. 1 abrt root    649 Jan 17 09:52 dso_list
-rw-r-----. 1 abrt root   2110 Jan 17 09:52 environ
-rw-r-----. 1 abrt root      9 Jan 17 09:52 executable
-rw-r-----. 1 abrt root      8 Jan 17 09:52 hostname
-rw-r-----. 1 abrt root     19 Jan 17 09:52 kernel
-rw-r-----. 1 abrt root   2914 Jan 17 09:52 maps
-rw-r-----. 1 abrt root     25 Jan 17 09:52 os_release
-rw-r-----. 1 abrt root     18 Jan 17 09:52 package
-rw-r-----. 1 abrt root      5 Jan 17 09:52 pid
-rw-r-----. 1 abrt root      4 Jan 17 09:52 pwd
-rw-r-----. 1 abrt root     51 Jan 17 09:52 reason
-rw-r-----. 1 abrt root     10 Jan 17 09:52 time
-rw-r-----. 1 abrt root      1 Jan 17 09:52 uid
-rw-r-----. 1 abrt root      5 Jan 17 09:52 username
-rw-r-----. 1 abrt root     40 Jan 17 09:52 uuid
-rw-r-----. 1 abrt root    620 Jan 17 09:52 var_log_messages

If we were debugging the crash we could poke around the saved environment files and then fire up gdb with the core dump to see where it crashed:

$ gdb /bin/bash coredump

GNU gdb (GDB) Fedora (7.3.50.20110722-10.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 20126]
Core was generated by `/bin/bash ./loop'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bash-4.2.20-1.fc16.x86_64
(gdb) backtrace
#0  0x000000344aabb83e in waitpid () from /lib64/libc.so.6
#1  0x0000000000440679 in ?? ()
#2  0x00000000004417cf in wait_for ()
#3  0x0000000000432736 in execute_command_internal ()
#4  0x0000000000434dae in execute_command ()
#5  0x00000000004351c5 in ?? ()
#6  0x000000000043150b in execute_command_internal ()
#7  0x0000000000434dae in execute_command ()
#8  0x000000000041e0b1 in reader_loop ()
#9  0x000000000041c8ef in main ()

From there you could navigate through the saved memory image to see caused the program to die an unexpected death. Now you may be asking yourself how exactly does abrt work? After digging through the source code I figured out that it installs a custom hook using the core_pattern kernel entry point:

$ cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e

Each time a crash occurs the kernel will invoke the hook above passing a number of arguments to the program listed in the core_pattern proc entry. From what I have derived from the source so far there are currently hooks for C/C++ and Python applications, and work is in progress to add support for Java. This is really cool stuff