How I saved myself 3-months of frustration by finding and fixing a sendmail bug on a vendors appliance

I recently upgraded a vendors appliance to a new version of code to address a couple of known isues. After the upgrade I noticed that the appliance wasn’t sending e-mail (yes some vendors still use e-mail) alerts when problems were detected. Luckily the appliance runs Linux and provides shell access through SSH if you enable it. My previous support experiences have been less than stellar so I figured I would root cause the problem and send them a bug report instead of opening a support case. In this post I’ll walk through what I did to figure out the problem.

To start my detective work I SSH’ed into the device and reviewed the release file to see which distribution they were using. In this case it was a Fedora-derivative. Given this piece of information the server was most likely using: 1) sendmail 2) postfix or 3) a custom written application to send e-mail. When I checked the system logs (messages and maillog) I found this useful nugget:

Aug 30 16:23:33 localhost sendmail[11217]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon MTA: cannot bind: Address already in use

Bingo! The host is using sendmail. Running ps revealed that services were managed by systemd. To get detailed service status for the sendmail process I ran systemctl with the status option:

$ systemctl status -l sendmail

‚óŹ sendmail.service - Sendmail Mail Transport Agent
   Loaded: loaded (/etc/systemd/system/sendmail.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2017-08-30 14:56:07 UTC; 14min ago
  Process: 24095 ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG (code=exited, status=0/SUCCESS)
 Main PID: 24097 (code=exited, status=71)

Aug 30 14:55:47 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:52 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:52 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:57 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:57 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:02 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:56:02 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Main process exited, code=exited, status=71/n/a
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Unit entered failed state.
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Failed with result 'exit-code'.

The logs indicate that sendmail isn’t able to start because something is already bound to the address it’s trying to use. Let’s see who is using it:

$ netstat -pant | grep “:25”
$

No dice. So something in the initialization process must be the cause of the “cannot bind: Address already in use” error (more on that later). Next up, lets see how sendmail is started:

$ systemctl show sendmail | grep ExecStart

 
ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG

The systemctl show command will display the unit file for a service. The ExecStart line contains the command line used to start the daemon. To get more information I ran the command line through strace:

$ strace -f -e trace=socket,bind /usr/sbin/sendmail -bd -q1h

socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
+++ exited with 0 +++
Process 12216 attached
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Process 12217 attached
[pid 12217] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12217, si_status=0, si_utime=0, si_stime=0} ---
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=0, ptr=0x7fd900000000}} ---
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

If you aren’t familiar with UNIX Network Programming (man I miss Richard Stevens) the bind() system call associates an address with a socket. In the strace output above the first bind() associates 0.0.0.0 (otherwise known as INADDR_ANY or the wildcard address) with the socket on file descriptor 4. In the second call to bind() the daemon tries to associate 127.0.0.1 (the loopback interface) with the socket on file descriptor 5. That fails because the wildcard address (INADDR_ANY) encompasses all IPv4 addresses including 127.0.0.1.

Having spent a number of years supporting postfix and sendmail I had a good suspicion this was due to a misconfiguration. I popped over to /etc/mail and paged down until I came across the following two entries:

O DaemonPortOptions=Name=MTA
O DaemonPortOptions=Port=smtp,Addr=127.0.0.1, Name=MTA

If you aren’t familiar with sendmail it uses the DaemonPortOptions setting to control which addresses and ports to listen on. In the first entry sendmail isn’t explicilty given an address to listen on so it defaults to INADDR_ANY (I deduced this by reading through the sendmail daemon.c source). We can see that in the first bind():

bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0>")}, 16) = 0

The second setting instructs sendmail to bind to address 127.0.0.1:

bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Since the first bind operation is using INADDR_ANY (which includes 127.0.0.1) the system call fails and sendmail eventually exits (need to read more code this weekend to see why). To get e-mail working again I commented out the first line and restarted sendmail. Low and behold my inbox got flooded. Yay mail. :)

It’s always fun pondering how things like this come to be. I’m curious if the upgrade added the second line to stop sendmail from binding to INADDR_ANY (a nice addition though it’s not called out in the release notes). I’m also curious if the code that handles this failed and didn’t remove the other line or if this change wasn’t rigorously tested before it was released. I’ll never know the answer to my pontifications (yes I make up words) unfortunately. What I do know is that I truly appreciate how elegant the UNIX Network programming model is and wish I would have been able to meet Richard Stevens when he was alive. He will always be one of the most influential technical people in my life and I’m going to pour out a 40 and re-read UNIX Network Programming in October to celebrate his awesomeness!

Using Ansible to verify remote file checksums with get_url, lookup() and stat

Being an extremely security minded operations guy I take every precaution to verify that the files I download are legit. In this day and age of servers and data getting compromised this should be an operational standard. There are numerous ways to verify checksums. You can use openssl’s various hashing options or a simple wrapper script similar to this. I prefer to automate everything so I typically offload these types of tasks to ansible and chef. Both configuration management systems give you a number of ways to tackle this and I thought I would discuss a couple of my favorite ansible recipes in this blog post.

To illustrate how easy this process is with ansible lets say you are setting up an elasticsearch cluster and want to keep up to date with the latest GEO IP database from maxmind.net. To retrieve the database and verify the checksum you can use the ansible get_url module with the checksum parameter and lookup filter:

---
- hosts: localhost
  connection: local
  vars:
    geoip_dir: "/tmp/geoip"
    geoip_db_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz"
    geoip_db_md5sum_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz.md5"
    geoip_db_compressed_file_name: "{{ geoip_db_url | basename }}"
    geoip_db_md5sum: "md5: {{ lookup('url', geoip_db_md5sum_url) }}"
  gather_facts: false
  tasks:
     - name: Create geoip directory if it doesn't exist
       file:
         path: "{{ geoip_dir }}"
         state: directory
         mode: 0700

     - name: "Downloading the latest GeoIP and comparing it to checksum {{ geoip_db_md5sum }}"
       get_url:
         url: "{{ geoip_db_url }}"
         dest: "{{ geoip_dir }}/{{ geoip_db_compressed_file_name }}"
         mode: 0600
         checksum: "{{ geoip_db_md5sum }}"

In the example above the lookup() filter will retrieve the MD5SUM from a remote file and assign that to the checksum parameter passed to get_url. If the remote file checkum matches the value of geoip_db_md5sum the file will be downloaded to the directory specified in the dest parameter. This is useful to show how versatile ansible is but the security conscious admin should be bugging out about retrieving a payload and checksum from the same server. Taking this a step further lets say you retrieved the checksum from a secure source and assigned it to the variable geoip_db_md5sum. This variable can then be referenced by the get_url checksum parameter:

---
- hosts: localhost
  connection: local
  vars:
    geoip_dir: "/tmp/geoip"
    geoip_db_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz"
    geoip_db_md5sum_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz.md5"
    geoip_db_compressed_file_name: "{{ geoip_db_url | basename }}"
    geoip_db_uncompressed_file_name: "{{ geoip_db_url | basename | replace('.tar.gz','')}}"
    geoip_db_md5sum: "md5: ca82582c02c4a4e57ec9d23a97adaa72"
  gather_facts: false
  tasks:
     - name: Create geoip directory if it doesn't exist
       file:
         path: "{{ geoip_dir }}"
         state: directory
         mode: 0700

     - name: "Downloading the latest GeoIP file"
       get_url:
         url: "{{ geoip_db_url }}"
         dest: "{{ geoip_dir }}/{{ geoip_db_compressed_file_name }}"
         checksum: "{{ geoip_db_md5sum }}"
         mode: 0600

Simple, elegant, but wait, there’s more! Ansible also has a stat module which you can use to retrieve the checksum of a file. This can be combined with get_url to achieve the same result (this isn’t the ideal way to solve this problem but is included to show the power of ansible):

---
- hosts: localhost
  connection: local
  vars:
    geoip_dir: "/tmp/geoip"
    geoip_db_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz"
    geoip_db_md5sum_url: "http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.tar.gz.md5"
    geoip_db_compressed_file_name: "{{ geoip_db_url | basename }}"
    geoip_db_uncompressed_file_name: "{{ geoip_db_url | basename | replace('.tar.gz','')}}"
    geoip_db_md5sum: "ca82582c02c4a4e57ec9d23a97adaa72"
  gather_facts: false
  tasks:
     - name: Create geoip directory if it doesn't exist
       file:
         path: "{{ geoip_dir }}"
         state: directory
         mode: 0700

     - name: "Downloading the latest GeoIP file"
       get_url:
         url: "{{ geoip_db_url }}"
         dest: "{{ geoip_dir }}/{{ geoip_db_compressed_file_name }}"
         mode: 0600

     - name: "Checking {{ geoip_db_compressed_file_name }} against checksum {{ geoip_db_md5sum }}"
       stat:
          path: "{{ geoip_dir }}/{{ geoip_db_compressed_file_name }}"
       register: st

     - name: extract the geo IP database
       unarchive:
         src: "{{ geoip_dir }}/{{ geoip_db_compressed_file_name }}"
         dest: "{{ geoip_dir }}"
       when: geoip_db_md5sum == st.stat.md5

In the example above the GEO IP database is downloaded and then the stat module is called to retrieve a variety of file metadata. One piece of metadata is the md5 parameter which contains an MD5 hash of the file passed as an argument to path. This value is compared against the value stored in geoip_db_md5sum and if the comparison succeeds the module (unarchive in this case) will run. Ansible supports a number of hash algorithms (SHA256, SHA512, etc.) which should be used in place of MD5 if you have the ability to do so. Gotta loves you some Ansible!

Retrieving a file name from a URL with ansible

Ansible has several extremely powerful modules for interacting with web resources. The get_url module allows you to retrieve resources, uri allows you to interact with web services and the templating and filtering capabilities provided by Jinja2 allow you to slice and dice the results in a variety of ways. One pattern that I find useful is applying the basename filter to the URL to grab a file name:

---
- hosts: 127.0.0.1
  connection: local
  vars:
    url: "http://foo.com/a/b/c/filename.tar.gz"
    file_name: "{{ url | basename }}"
  gather_facts: false
  tasks:
     - name: "Grab the file name from {{ url }}"
       debug:
         msg: "Base name is {{ file_name }}"

After the filter is applied file_name will contain the last component in the “/” separated URL:

$ ansible-playbook get-basename.yml

PLAY [127.0.0.1] **************************************************************************************************

TASK [Grab the file name from http://foo.com/a/b/c/filename.tar.gz] ***********************************************
ok: [127.0.0.1] => {
    "msg": "Base name is filename.tar.gz"
}

PLAY RECAP ********************************************************************************************************
127.0.0.1                  : ok=1    changed=0    unreachable=0    failed=0

I’ll share some valid uses for this in a future blog post.

How elasticsearch bootstrap checks affect development and production mode

One of my friends reached out to me earlier this week to help him with an elasticsearch issue. He was trying to bring up a new cluster to see how ES compares to splunk and was getting a “bootstrap checks failed” error at startup. This was causing his elasticsearch java processes to bind to localhost instead of the hostname he assigned to the network.host value. Here is a snippet of what I saw when I reviewed the logs:

[2017-08-19T11:31:25,457][ERROR][o.e.b.Bootstrap          ] [elastic01] node validation exception
[2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Elasticsearch has two modes of operation: development and production. In development mode elasticsearch will bind to localhost allowing you to tinker with settings, test features and break things without impacting other nodes on your network. In production mode elasticsearch will bind to an external interface allowing it to communicate with other nodes and form clusters. Elasticsearch runs a number of bootstrap checks to help it figure out which mode to operate in. These checks are put in place to protect you server from data corruption and network partitions which the developers have seen more than once:

“Collectively, we have a lot of experience with users suffering unexpected issues because they have not configured important settings. In previous versions of Elasticsearch, misconfiguration of some of these settings were logged as warnings. Understandably, users sometimes miss these log messages. To ensure that these settings receive the attention that they deserve, Elasticsearch has bootstrap checks upon startup.”

The settings the documentation is referring to are described in the important settings and system settings documentation. In my friends case he didn’t increase the vm.max_map_count or the number of file descriptors available to the elasticsearch Java process. Once he got these fixed up his test cluster fired right up.

Forwarding the systemd journal to syslog

I am a big fan of the ELK stack and use it daily for various parts of my job. All of the logs from my physical systems, VMs and containers get funneled into elasticsearch, indexed and are available for me to slice and dice with Kibana. In addition to syslog data I also like to funnel the systemd journal into elasticsearch. This is easily accomplished by changing the journald.conf ForwardToSyslog configuration directive to yes:

$ sed -i.bak ‘s/#ForwardToSyslog/ForwardToSyslog\=yes/’ /etc/systemd/journald.conf

This small change will cause all journal entries to get routed to the local syslog daemon. Once they are there you can set up your favorite log shipping solution to get them into your elasticsearch cluster.

Becoming a GDB power user is a truly valuable thing

I’ve been around Linux and UNIX for quite some time and one thing that has always piqued my interests is debugging broken software. Bryan Cantrill made some excellent points on why postmortem debugging is needed at DOCKERCON and the following video is a must watch:

His points on restarting a broken container w/o root causing the source of the failure is SPOT ON! I also love his mad cow analogy. I’ve had the same mind set since I started managing infrastructure and I find the whole root cause process exciting and fun. Who doesn’t love looking at backtraces, registers and memory on the stack?!!?! Most admins I’ve met like debugging but they dread seeing the following:

$ ./badproc
Segmentation fault (core dumped) ./badproc

I on the other hand start to drool when a piece of software I manage (but didn’t write) encounters a fatal condition that leads to its demise. If I can’t locate a bug report with a fix I’ll grab a cup of coffee, ensure debugging symbols are present and fire up gdb to root cause the failure. My first experience root causing a segmentation violation was with snort. This was an extremely valuable learning experience and at the time the Internet had limited resources explaining stack layouts, memory organization and how gdb can be used to locate problems. Now that conferences and individuals are posting high quality material to Youtube two clicks will get you access to amazing gdb resources like this (all three videos are definitely worth watching):

We also have access to step-by-step software debugging guides like the one Brendan Gregg posted to his blog last year. This coming weekend I will be immersing myself in another epic debugging session and I can’t wait to see what I find (and learn). We all need to learn to embrace the unhappy signals that take down our applications. You learn a TON by doing so and make the opensource world better at the same time.