How I saved myself 3-months of frustration by finding and fixing a sendmail bug on a vendors appliance

I recently upgraded a vendors appliance to a new version of code to address a couple of known isues. After the upgrade I noticed that the appliance wasn’t sending e-mail (yes some vendors still use e-mail) alerts when problems were detected. Luckily the appliance runs Linux and provides shell access through SSH if you enable it. My previous support experiences have been less than stellar so I figured I would root cause the problem and send them a bug report instead of opening a support case. In this post I’ll walk through what I did to figure out the problem.

To start my detective work I SSH’ed into the device and reviewed the release file to see which distribution they were using. In this case it was a Fedora-derivative. Given this piece of information the server was most likely using: 1) sendmail 2) postfix or 3) a custom written application to send e-mail. When I checked the system logs (messages and maillog) I found this useful nugget:

Aug 30 16:23:33 localhost sendmail[11217]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon MTA: cannot bind: Address already in use

Bingo! The host is using sendmail. Running ps revealed that services were managed by systemd. To get detailed service status for the sendmail process I ran systemctl with the status option:

$ systemctl status -l sendmail

‚óŹ sendmail.service - Sendmail Mail Transport Agent
   Loaded: loaded (/etc/systemd/system/sendmail.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2017-08-30 14:56:07 UTC; 14min ago
  Process: 24095 ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG (code=exited, status=0/SUCCESS)
 Main PID: 24097 (code=exited, status=71)

Aug 30 14:55:47 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:52 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:52 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:55:57 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:55:57 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:02 gorp sendmail[24097]: NOQUEUE: SYSERR(root): opendaemonsocket: daemon Daemon1: cannot bind: Address already in use
Aug 30 14:56:02 gorp sendmail[24097]: daemon Daemon1: problem creating SMTP socket
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Main process exited, code=exited, status=71/n/a
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Unit entered failed state.
Aug 30 14:56:07 gorp systemd[1]: sendmail.service: Failed with result 'exit-code'.

The logs indicate that sendmail isn’t able to start because something is already bound to the address it’s trying to use. Let’s see who is using it:

$ netstat -pant | grep “:25”
$

No dice. So something in the initialization process must be the cause of the “cannot bind: Address already in use” error (more on that later). Next up, lets see how sendmail is started:

$ systemctl show sendmail | grep ExecStart

 
ExecStart=/usr/sbin/sendmail -bd -q $QUEUE $SENDMAIL_OPTARG

The systemctl show command will display the unit file for a service. The ExecStart line contains the command line used to start the daemon. To get more information I ran the command line through strace:

$ strace -f -e trace=socket,bind /usr/sbin/sendmail -bd -q1h

socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
+++ exited with 0 +++
Process 12216 attached
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Process 12217 attached
[pid 12217] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12217, si_status=0, si_utime=0, si_stime=0} ---
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=0, ptr=0x7fd900000000}} ---
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

If you aren’t familiar with UNIX Network Programming (man I miss Richard Stevens) the bind() system call associates an address with a socket. In the strace output above the first bind() associates 0.0.0.0 (otherwise known as INADDR_ANY or the wildcard address) with the socket on file descriptor 4. In the second call to bind() the daemon tries to associate 127.0.0.1 (the loopback interface) with the socket on file descriptor 5. That fails because the wildcard address (INADDR_ANY) encompasses all IPv4 addresses including 127.0.0.1.

Having spent a number of years supporting postfix and sendmail I had a good suspicion this was due to a misconfiguration. I popped over to /etc/mail and paged down until I came across the following two entries:

O DaemonPortOptions=Name=MTA
O DaemonPortOptions=Port=smtp,Addr=127.0.0.1, Name=MTA

If you aren’t familiar with sendmail it uses the DaemonPortOptions setting to control which addresses and ports to listen on. In the first entry sendmail isn’t explicilty given an address to listen on so it defaults to INADDR_ANY (I deduced this by reading through the sendmail daemon.c source). We can see that in the first bind():

bind(4, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("0.0.0.0>")}, 16) = 0

The second setting instructs sendmail to bind to address 127.0.0.1:

bind(5, {sa_family=AF_INET, sin_port=htons(25), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Since the first bind operation is using INADDR_ANY (which includes 127.0.0.1) the system call fails and sendmail eventually exits (need to read more code this weekend to see why). To get e-mail working again I commented out the first line and restarted sendmail. Low and behold my inbox got flooded. Yay mail. :)

It’s always fun pondering how things like this come to be. I’m curious if the upgrade added the second line to stop sendmail from binding to INADDR_ANY (a nice addition though it’s not called out in the release notes). I’m also curious if the code that handles this failed and didn’t remove the other line or if this change wasn’t rigorously tested before it was released. I’ll never know the answer to my pontifications (yes I make up words) unfortunately. What I do know is that I truly appreciate how elegant the UNIX Network programming model is and wish I would have been able to meet Richard Stevens when he was alive. He will always be one of the most influential technical people in my life and I’m going to pour out a 40 and re-read UNIX Network Programming in October to celebrate his awesomeness!

Printing the entries in a sendmail queue

Since I began managing sendmail installations 8-years ago, I have used the mailq command to print the entries in the sendmail queue. While reading through the bat book to find the root cause of my sendmail DNS problems, I came across sendmail’s “-bp” option:

$ /usr/sbin/sendmail -bp

/var/spool/mqueue is empty
                Total requests: 0

This provides the same results as the mailq command:

$ mailq

/var/spool/mqueue is empty
                Total requests: 0

I am still amazed at how complex the sendmail program is. Rule 0, Rule 3, Rule 5, 8000 macros … ge gads!

Optimizing sendmail smart hosts

While debugging an issue last week, I noticed some odd behavior when the Sendmail Smart host configuration directive (DS) was used. Smart hosts allow nodes to delegate mail delivery services to upstream devices, alleviating the need to run a full blown mail relay on each host. Smart hosts are configured with the Sendmail MC SMART_HOST directive, or through the sendmail.cf/submit.cf DS variable:

$ grep DS /etc/mail/submit.cf
DSsmtp-relay.dns.net

While I was performing a packet capture to isolate a problem with mail delivery, I noticed that each outbound message triggered several odd DNS lookups:

$ snoop -d eri0 port 53

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net. Internet MX ?
dnsserver.dns.net -> s1 DNS R

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net.internal.dns.net. Internet MX ?
dnsserver.dns.net -> s1 DNS R Error: 3(Name Error)

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net.corp.dns.net. Internet MX ?
dnsserver.dns.net -> s1 DNS R Error: 3(Name Error) 

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net.dns.net. Internet MX ?
dnsserver.dns.net -> s1 DNS R Error: 3(Name Error)

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net. Internet Unknown (28) ?
dnsserver.dns.net -> s1 DNS R

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.net. Internet Addr ?
dnsserver.dns.net -> s1 DNS R smtp-relay.dns.net. Internet Addr 1.2.3.4 

Upon closer inspection, I noticed that sendmail (actually the resolution libraries) were attempting to resolve the SMTP relay along with each entry in the search path. To fix this issue ( I think it’s an issue, since it’s wasting bandwidth and CPU cycles), I fully qualified the smart host in my submit.cf:

$ grep DS /etc/mail/submit.cf
DSsmtp-relay.dns.net.

Once this was changed, and sendmail was restarted, the resolver only requested entries for the FQDN:

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.com. Internet MX ?
dnsserver.dns.net -> s1 DNS R

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.com. Internet Unknown (28) ?
dnsserver.dns.net -> s1 DNS R

s1 -> dnsserver.dns.net DNS C smtp-relay.dns.com. Internet Addr ?
dnsserver.dns.net -> s1 DNS R smtp-relay.dns.com. Internet Addr 1.2.3.4

I just dusted off my bat book, and hope to find an MC setting to disable MX lookups for the smart host. That should kill four more UDP datagrams. :)

Binding sendmail to the loopback interface

The sendmail SMTP server comes with the vast majority of UNIX Operating systems, and is configured to listen for new connections on TCP ports *.25 (SMTP) and *.587 (MSP) by default. For workstation and servers that aren’t responsible for mail delivery, this can cause chaos when a new sendmail exploit is released into the wild. This behavior can be changed by adjusting the “DaemonPortOptions” in the sendmail configuration file (usually /etc/mail/sendmail.cf). The default configuration looks similar to the following:

O DaemonPortOptions=Name=MTA-v4, Family=inet
O DaemonPortOptions=Port=587, Name=MSA, M=E

If we add “Addr=127.0.0.1” to each entry, sendmail will only listen for new connections on the loopback interface:

O DaemonPortOptions=Addr=127.0.0.1,Port=25,Name=MTA
O DaemonPortOptions=Addr=127.0.0.1,Port=587,Name=MSA, M=E

Once the changes are integrated into the sendmail.cf file ( hand editing the sendmail.cf file or using M4 macros ), sendmail needs to be restarted. Once sendmail is restarted, we can view the new behavior with the netstat command:

$ netstat -an | egrep LISTEN | egrep ‘(25|587)’
127.0.0.1.25 *.* 0 0 49152 0 LISTEN
127.0.0.1.587 *.* 0 0 49152 0 LISTEN

Back in the sendmail 8.10/8.11 days, a smart relay could be used to forward mail, alleviating the need to run sendmail as a daemon. I am still trying to find a way to revert back to the old behavior, but the MSP seems to cause some issues when smart relays are in use. More to come …