How I saved myself 3-months of frustration by finding and fixing a sendmail bug on a vendors appliance


I recently upgraded a vendors appliance to a new version of code to address a couple of known isues. After the upgrade I noticed that the appliance wasn’t sending e-mail (yes some vendors still use e-mail) alerts when problems were detected. Luckily the appliance runs Linux and provides shell access through SSH if you enable it. My previous support experiences have been less than stellar so I figured I would root cause the problem and send them a bug report instead of opening a support case. In this post I’ll walk through what I did to figure out the problem.

To start my detective work I SSH’ed into the device and reviewed the release file to see which distribution they were using. In this case it was a Fedora-derivative. Given this piece of information the server was most likely using: 1) sendmail 2) postfix or 3) a custom written application to send e-mail. When I checked the system logs (messages and maillog) I found this useful nugget:

Aug 30 16:23:33 localhost sendmail[11217]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon MTA: cannot bind: Address already in use

Bingo! The host is using sendmail. Running ps revealed that services were managed by systemd. To get detailed service status for the sendmail process I ran systemctl with the status option:

$ systemctl status -l sendmail

● sendmail.service - Sendmail Mail Transport Agent
Loaded: loaded (/etc/systemd/system/sendmail.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2017-08-30 14:56:07 UTC; 14min ago
Process: 24095 ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG (code=exited, status=0/SUCCESS)
Main PID: 24097 (code=exited, status=71)

Aug 30 14:55:47 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:52 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:52 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:57 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:57 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:02 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:56:02 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Main process exited, code=exited, status=71/n/a
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Unit entered failed state.
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Failed with result 'exit-code'.

The logs indicate that sendmail isn’t able to start because something is already bound to the address it’s trying to use. Let’s see who is using it:

$ netstat -pant | grep ":25"
$

No dice. So something in the initialization process must be the cause of the “cannot bind: Address already in use” error (more on that later). Next up, lets see how sendmail is started:

$ systemctl show sendmail | grep ExecStart

ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG

The systemctl show command will display the unit file for a service. The ExecStart line contains the command line used to start the daemon. To get more information I ran the command line through strace:

$ strace -f -e trace=socket,bind /usr/sbin/sendmail -bd -q1h

socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
+++ exited with 0 +++
Process 12216 attached
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Process 12217 attached
[pid 12217] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12217, si_status=0, si_utime=0, si_stime=0} ---
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=0, ptr=0x7fd900000000}} ---
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

If you aren’t familiar with UNIX Network Programming (man I miss Richard Stevens) the bind() system call associates an address with a socket. In the strace output above the first bind() associates 0.0.0.0 (otherwise known as INADDR_ANY or the wildcard address) with the socket on file descriptor 4. In the second call to bind() the daemon tries to associate 127.0.0.1 (the loopback interface) with the socket on file descriptor 5. That fails because the wildcard address (INADDR_ANY) encompasses all IPv4 addresses including 127.0.0.1.

Having spent a number of years supporting postfix and sendmail I had a good suspicion this was due to a misconfiguration. I popped over to /etc/mail and paged down until I came across the following two entries:

O DaemonPortOptions=Name=MTA
O DaemonPortOptions=Port=smtp,Addr=127.0.0.1, Name=MTA

If you aren’t familiar with sendmail it uses the DaemonPortOptions setting to control which addresses and ports to listen on. In the first entry sendmail isn’t explicilty given an address to listen on so it defaults to INADDR_ANY (I deduced this by reading through the sendmail daemon.c source). We can see that in the first bind():

bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0>")}, 16) = 0

The second setting instructs sendmail to bind to address 127.0.0.1:

bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Since the first bind operation is using INADDR_ANY (which includes 127.0.0.1) the system call fails and sendmail eventually exits (need to read more code this weekend to see why). To get e-mail working again I commented out the first line and restarted sendmail. Low and behold my inbox got flooded. Yay mail. :)

It’s always fun pondering how things like this come to be. I’m curious if the upgrade added the second line to stop sendmail from binding to INADDR_ANY (a nice addition though it’s not called out in the release notes). I’m also curious if the code that handles this failed and didn’t remove the other line or if this change wasn’t rigorously tested before it was released. I’ll never know the answer to my pontifications (yes I make up words) unfortunately. What I do know is that I truly appreciate how elegant the UNIX Network programming model is and wish I would have been able to meet Richard Stevens when he was alive. He will always be one of the most influential technical people in my life and I’m going to pour out a 40 and re-read UNIX Network Programming in October to celebrate his awesomeness!

This article was posted by Matty on 2017-08-30 15:55:00 -0400 -0400