Conditionally restarting systemd services

In a previous post I discussed how one of my systemd services was getting continuously restarted causing the CPU to spike. This isn’t ideal and after re-reading the systemd manual page I came across a couple of useful options to control when and how frequently a systemd service will restart. The first option is RestartSec which controls how long systemd will wait to restart a process after a failure occurs. Systemd also has the RestartForceExitStatus and RestartPreventExitStatus which allow you to define the signals that should or should not cause a restart. To ensure that logstash isn’t continuously restarted when a configuration issue is present I set RestartPreventExitStatus to 1. Abnormal signals like SIGSEGV, SIGABRT, etc. will trigger a restart but a return code of 1 will not.

Troubleshooting a bizarre logstash CPU problem

This week I was converting some legacy shell scripts to ansible roles and wandered into a bizarre issue with one of my elasticsearch servers. After committing a couple of changes my CI system rejected the commit due to a system resource issue. When I logged into the system to troubleshoot the issue I noticed the CPU was pegged:

$ mpstat 1

Linux 3.10.0-514.26.2.el7.x86_64 (elastic02) 	09/02/2017 	_x86_64_	(2 CPU)

11:32:59 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:33:00 AM  all   94.03    0.00    3.98    0.00    0.00    0.00    0.00    0.00    0.00    1.99
11:33:01 AM  all   94.00    0.00    3.50    0.00    0.00    0.00    0.00    0.00    0.00    2.50
11:33:02 AM  all   93.50    0.00    5.50    0.00    0.00    0.00    0.00    0.00    0.00    1.00

This system is used solely to test changes so it should have been 100% idle. Htop showed the Logstash java process as the top CPU consumer so I ventured off to the log directory to see if anything was awry. Sure enough, Logstash was complaining about a missing file:

[2017-09-02T11:37:43,884][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Something is wrong with your configuration."}
[2017-09-02T11:37:52,529][ERROR][logstash.filters.geoip   ] Invalid setting for geoip filter plugin:

  filter {
    geoip {
      # This setting must be a path
      # File does not exist or cannot be opened /elk/logstash/geoip/GeoLite2-City.mmdb
      database => "/elk/logstash/geoip/GeoLite2-City.mmdb"
      ...
    }
  }

One of my commits automated the GeoIP database download process. This subtle change contained a bug (more on that in a future post) which was preventing the file from being downloaded and extracted. By default, logstash should exit if it detects a configuration issue. It shouldn’t spin on a CPU. To see what was going on I fired up my good friend strace and attmpted to attch to the PID I retrieved from ps:

$ strace -f -p 4288

strace: attach: ptrace(PTRACE_ATTACH, ...): No such process

Now that is odd. When I ran ps again and looked closer at the pid column I noticed that the java process had a new PID. I ran it a second time and the pid changed again. So it appeared that logstash was dying and then being restarted over and overr. The Logstash service on this machine is managed by systemd and it immediately dawned on me that the “Restart=on-failure” unit file configuration directive (this will restart a failed process) was the cause of my troubles! After a bit of troubleshooting I located the problem with my ansible role, corrected it and low and behold the system load returned to normal. Bugs and cascading failures happen and this problem never made its way to production because my automated test suite let me know that something was wrong. Thank you jenkins!

How I saved myself 3-months of frustration by finding and fixing a sendmail bug on a vendors appliance

I recently upgraded a vendors appliance to a new version of code to address a couple of known isues. After the upgrade I noticed that the appliance wasn’t sending e-mail (yes some vendors still use e-mail) alerts when problems were detected. Luckily the appliance runs Linux and provides shell access through SSH if you enable it. My previous support experiences have been less than stellar so I figured I would root cause the problem and send them a bug report instead of opening a support case. In this post I’ll walk through what I did to figure out the problem.

To start my detective work I SSH’ed into the device and reviewed the release file to see which distribution they were using. In this case it was a Fedora-derivative. Given this piece of information the server was most likely using: 1) sendmail 2) postfix or 3) a custom written application to send e-mail. When I checked the system logs (messages and maillog) I found this useful nugget:

Aug 30 16:23:33 localhost sendmail[11217]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon MTA: cannot bind: Address already in use

Bingo! The host is using sendmail. Running ps revealed that services were managed by systemd. To get detailed service status for the sendmail process I ran systemctl with the status option:

$ systemctl status -l sendmail

‚óŹ sendmail.service - Sendmail Mail Transport Agent
   Loaded: loaded (/etc/systemd/system/sendmail.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2017-08-30 14:56:07 UTC; 14min ago
  Process: 24095 ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG (code=exited, status=0/SUCCESS)
 Main PID: 24097 (code=exited, status=71)

Aug 30 14:55:47 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:52 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:52 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:57 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:57 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:02 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:56:02 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Main process exited, code=exited, status=71/n/a
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Unit entered failed state.
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Failed with result 'exit-code'.

The logs indicate that sendmail isn’t able to start because something is already bound to the address it’s trying to use. Let’s see who is using it:

$ netstat -pant | grep “:25”
$

No dice. So something in the initialization process must be the cause of the “cannot bind: Address already in use” error (more on that later). Next up, lets see how sendmail is started:

$ systemctl show sendmail | grep ExecStart

 
ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG

The systemctl show command will display the unit file for a service. The ExecStart line contains the command line used to start the daemon. To get more information I ran the command line through strace:

$ strace -f -e trace=socket,bind /usr/sbin/sendmail -bd -q1h

socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
+++ exited with 0 +++
Process 12216 attached
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Process 12217 attached
[pid 12217] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12217, si_status=0, si_utime=0, si_stime=0} ---
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=0, ptr=0x7fd900000000}} ---
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

If you aren’t familiar with UNIX Network Programming (man I miss Richard Stevens) the bind() system call associates an address with a socket. In the strace output above the first bind() associates 0.0.0.0 (otherwise known as INADDR_ANY or the wildcard address) with the socket on file descriptor 4. In the second call to bind() the daemon tries to associate 127.0.0.1 (the loopback interface) with the socket on file descriptor 5. That fails because the wildcard address (INADDR_ANY) encompasses all IPv4 addresses including 127.0.0.1.

Having spent a number of years supporting postfix and sendmail I had a good suspicion this was due to a misconfiguration. I popped over to /etc/mail and paged down until I came across the following two entries:

O DaemonPortOptions=Name=MTA
O DaemonPortOptions=Port=smtp,Addr=127.0.0.1, Name=MTA

If you aren’t familiar with sendmail it uses the DaemonPortOptions setting to control which addresses and ports to listen on. In the first entry sendmail isn’t explicilty given an address to listen on so it defaults to INADDR_ANY (I deduced this by reading through the sendmail daemon.c source). We can see that in the first bind():

bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0>")}, 16) = 0

The second setting instructs sendmail to bind to address 127.0.0.1:

bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Since the first bind operation is using INADDR_ANY (which includes 127.0.0.1) the system call fails and sendmail eventually exits (need to read more code this weekend to see why). To get e-mail working again I commented out the first line and restarted sendmail. Low and behold my inbox got flooded. Yay mail. :)

It’s always fun pondering how things like this come to be. I’m curious if the upgrade added the second line to stop sendmail from binding to INADDR_ANY (a nice addition though it’s not called out in the release notes). I’m also curious if the code that handles this failed and didn’t remove the other line or if this change wasn’t rigorously tested before it was released. I’ll never know the answer to my pontifications (yes I make up words) unfortunately. What I do know is that I truly appreciate how elegant the UNIX Network programming model is and wish I would have been able to meet Richard Stevens when he was alive. He will always be one of the most influential technical people in my life and I’m going to pour out a 40 and re-read UNIX Network Programming in October to celebrate his awesomeness!

Riemann and the case of the unhappy system call

This past weekend I spent a good deal of time playing with riemann. Riemann is a powerful stream processor and I’m hoping to use it to correlated and analyze metrics from disparate data sources. After downloading and installing it I received the following error:

$ ./riemann
ERROR [2016-10-10 12:21:36,614] main – riemann.bin – Couldn’t start
java.util.concurrent.ExecutionException: java.lang.UnsatisfiedLinkError: /tmp/libnetty-transport-native-epoll3719688563306389605.so: /tmp/libnetty-transport-native-epoll3719688563306389605.so: failed to map segment from shared object

Not exactly what I was exepecting on our first date but I guess riemann plays hard to get. :) To make a little more sense out of this error I fired up strace to retrieve the ERRNO value and to see which system call was failing:

$ strace -f ./riemann
…..
[pid 9826] open(“/tmp/libnetty-transport-native-epoll27354233456383270.so”, O_RDONLY|O_CLOEXEC) = 45
[pid 9826] read(45, “\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240>\0\0\0\0\0\0″…, 832) = 832
[pid 9826] fstat(45, {st_mode=S_IFREG|0644, st_size=63749, …}) = 0
[pid 9826] mmap(NULL, 2146168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 45, 0) = -1 EPERM (Operation not permitted)
…..

Now that hits the spot! mmap() is getting an EPERM which the mmap(2) manual page describes as:

“EPERM The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.”

One problem solved! The tmp file system is mounted with the noexec flag which prohibits the mmap() from running with PROT_EXEC set. Here’s a picture of tmp:

$ mount | grep “/tmp”
tmpfs on /tmp type tmpfs (rw,noexec)

I’m not one to disable security measures so I went looking for a workaround. After reading through the netty source code for 20 minutes I came across this nugget:

NativeLibraryLoader.java
            f = toDirectory(SystemPropertyUtil.get("java.io.tmpdir"));
            if (f != null) {
                logger.debug("-Dio.netty.tmpdir: " + f + " (java.io.tmpdir)");
                return f;
            }

Netty uses the java.io.tmpdir property to craft the temporary file location. Further digging showed that riemann passes properties to the Java runtime through the EXTRA_JAVA_OPTS variable:

exec java $EXTRA_JAVA_OPTS $OPTS -cp "$JAR" riemann.bin "$COMMAND" "$CONFIG"

Furthermore, you can pass properties to the Java runtime via the “-D” option. So to make riemann happy I set the EXTRA_JAVA_OPTS environment variable to “-Djava.io.tmpdir=/path/to/new/tmpdir” and fired off riemann:

$ EXTRA_JAVA_OPTS=-Djava.io.tmpdir=/sfw/riemann/latest/tmp cd $RIEMANN_HOME && ./riemann

Bingo! Riemann started and I was able to start working with it. Clojure takes me back to my Lisp programming days in college. Glad vim has () matching built in! :)

Ganglia never met a quote it didn’t like. Wait it did …

This week I encountered a weird issue while developing a new Ganglia plug-in. After moving my ganglia processes to a docker container I noticed that the grid overview images weren’t displaying. This was a new ganglia installation so I figured I typo’ed something in the gmetad.conf configuration file. I reviewed the file in my git repo and everything looked perfectly fine. Any Apache errors? Gak!:

$ tail -6 /var/log/httpd/error_log

sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file

Well those messages are super helpful lol. Something is being run through /bin/sh and it’s missing a matching single quote. But what command or script? Luckily I follow the world famous performance expert Brendan Gregg’s blog so I was aware of the Linux eBPF and bcc project. If you aren’t familiar with this amazing project here is a short blurb from their website:

“BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.”

I knew Brendan ported a large number of the DTraceToolkit scripts to bcc so I went searching for execsnoop to see what is being executed:

$ ./execsnoop

PCOMM            PID    PPID   RET ARGS
sh               18748  18350    0   /usr/bin/rrdtool
rrdtool          18748  18350    0 /usr/bin/rrdtool
sh               18749  18350    0   /usr/bin/rrdtool graph /dev/null  --start -3600s --end now DEF:avg='/ganglia/rrds// __SummaryInfo__/cpu_num.rrd':'sum':AVERAGE PR

Bingo! Now I have a lead, but it appears execsnoop limit’s the arguments displayed (NTB: need to figure out why). So lets go to the Ganglia source code to see whats up:

$ cd /usr/share/ganglia

$ grep rrdtool *.php
*** lots of output ***

In the output above I noticed several exec() statements in graph.php so I decided to begin there. Starting from the main code block and reading down I came across the following:

if ($debug) {
  error_log("Final rrdtool command:  $command");
}

Digging further I noticed that debug output (including the full string that is being passed to the shell) could be set through a query string variable:

$debug = isset($_GET['debug']) ? clean_number(sanitize($_GET["debug"])) : 0;

Sweet! Adding “&debug=1” to the end of the ganglia URL got me a useful log entry in my error_log:

[Fri Oct 07 13:18:01.429310 2016] [:error] [pid 19884] [client 192.168.1.178:51124] Final rrdtool command:  /usr/bin/rrdtool graph -   --start '-3600s' --end now --width 650 --height 300 --title 'Matty\\'s Bodacious Grid Load last hour' --vertical-label 'Loads/Procs' --lower-limit 0 --slope-mode   DEF:'a0'='/ganglia/rrds//__SummaryInfo__/load_one.rrd':'sum':AVERAGE  DEF:'a1'='/ganglia/rrds//__SummaryInfo__/cpu_num.rrd':'num':AVERAGE  DEF:'a2'='/ganglia/rrds//__SummaryInfo__/cpu_num.rrd':'sum':AVERAGE  DEF:'a3'='/ganglia/rrds//__SummaryInfo__/proc_run.rrd':'sum':AVERAGE   AREA:'a0'#BBBBBB:'1-min' VDEF:a0_last=a0,LAST VDEF:a0_min=a0,MINIMUM VDEF:a0_avg=a0,AVERAGE VDEF:a0_max=a0,MAXIMUM GPRINT:'a0_last':'Now\\:%5.1lf%s' GPRINT:'a0_min':'Min\\:%5.1lf%s' GPRINT:'a0_avg':'Avg\\:%5.1lf%s' GPRINT:'a0_max':'Max\\:%5.1lf%s\\l' LINE2:'a1'#00FF00:'Nodes' VDEF:a1_last=a1,LAST VDEF:a1_min=a1,MINIMUM VDEF:a1_avg=a1,AVERAGE VDEF:a1_max=a1,MAXIMUM GPRINT:'a1_last':'Now\\:%5.1lf%s' GPRINT:'a1_min':'Min\\:%5.1lf%s' GPRINT:'a1_avg':'Avg\\:%5.1lf%s' GPRINT:'a1_max':'Max\\:%5.1lf%s\\l' LINE2:'a2'#FF0000:'CPUs ' VDEF:a2_last=a2,LAST VDEF:a2_min=a2,MINIMUM VDEF:a2_avg=a2,AVERAGE VDEF:a2_max=a2,MAXIMUM GPRINT:'a2_last':'Now\\:%5.1lf%s' GPRINT:'a2_min':'Min\\:%5.1lf%s' GPRINT:'a2_avg':'Avg\\:%5.1lf%s' GPRINT:'a2_max':'Max\\:%5.1lf%s\\l' LINE2:'a3'#2030F4:'Procs' VDEF:a3_last=a3,LAST VDEF:a3_min=a3,MINIMUM VDEF:a3_avg=a3,AVERAGE VDEF:a3_max=a3,MAXIMUM GPRINT:'a3_last':'Now\\:%5.1lf%s' GPRINT:'a3_min':'Min\\:%5.1lf%s' GPRINT:'a3_avg':'Avg\\:%5.1lf%s' GPRINT:'a3_max':'Max\\:%5.1lf%s\\l' 

Running that manually from a shell prompt generated the same error as the one I originally found! Excellent, now I know exactly what is causing the error. I opened the output above in vim and searched for a single quote. That highlighted all single quotes and my eyes immediately picked up the following:

–title ‘Matty\’s Bodacious Grid Load last hour’

Oh snap, I found a gem here. A LOOOONG time ago in a shell scripting book (possibly in UNIX shell programming?) far, far away I remembered reading that you can never have a single quote inside a pair of single quotes. Googling this turned up the following POSIX standard which states:

“A single-quote cannot occur within single-quotes.”

When I removed the single quote after Matty and re-ran rrdtool it ran w/o error and spit out a nice png. So could that be the issue? I removed the single quote from the gridname directive in gmetad.conf, bounced gmetad and low and behold all of my graphs started showing up. Now to finish up my plug-in.

Summarizing system call activity on Linux systems

Linux has a guadzillion debugging utilities available. One of my favorite tools for debugging problems is strace, which allows you to observe the system calls a process is making in realtime. Strace also has a “-c” option to summarize system call activity:

$ strace -c -p 28009

Process 28009 attached
Process 28009 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.17    0.001278           2       643           select
 14.80    0.000242           1       322           read
  7.03    0.000115           0       322           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.001635                  1287           total

The output contains the percentage of time spent in each system call, the total time in seconds, the microseconds per call, the total number of calls, a count of the number of errors and the type of system call that was made. This output has numerous uses, and is a great way to observe how a process is interacting with the kernel. Good stuff!