Creating Bind query log statistics with dnsrecon

A month or two back I was investigating a production issue and wanted to visualize our Bind query logs. The Bind statistics channel looked useful but there wasn’t enough data to help me troubleshoot my issue. In the spirit of software re-use I looked at a few opensource query log parsing utilities. The programs I found used MySQL and once again they didn’t have enough data to fit my needs. My needs were pretty simple:

– Summarize queries by record type
– Show the top # records requested
– Show the top # of clients querying the server
– Print DNS query histograms by minute
– Print DNS query histograms by hour
– Have an extended feature to list all of the clients querying a record
– Allow the records to be filtered by time periods

Instead of mucking around with these solutions I wrote dnsrecon. Dnsrecon takes one or more logs as an argument and produces a compact DNS query log report which can be viewed in a terminal window:

$ dnsrecon.py logs/* --histogram

Processing logfile ../logs/named.queries
Processing logfile ../logs/named.queries.0
Processing logfile ../logs/named.queries.1
Processing logfile ../logs/named.queries.2
Processing logfile ../logs/named.queries.3
Processing logfile ../logs/named.queries.4
Processing logfile ../logs/named.queries.5
Processing logfile ../logs/named.queries.6
Processing logfile ../logs/named.queries.7
Processing logfile ../logs/named.queries.8
Processing logfile ../logs/named.queries.9

Summary for 05-Nov-2016 10:31:36.230 - 08-Nov-2016 14:15:51.426

Total DNS_QUERIES processed : 9937837
  PTR    records requested : 6374013
  A      records requested : 3082344
  AAAA   records requested : 372332
  MX     records requested : 32593
  TXT    records requested : 23508
  SRV    records requested : 19815
  SOA    records requested : 19506
  NS     records requested : 6661
  DNSKEY records requested : 2286

Top  100  DNS names requested:
  prefetch.net : 81379
  sheldon.prefetch.net : 75244
  penny.prefetch.net : 54637
  ..... 

Top  100  DNS clients:
  blip.prefetch.net :  103680
  fmep.prefetch.net :  92486
  blurp.prefetch.net : 32456
  gorp.prefetch.net : 12324
  .....

Queries per minute:
  00: ******************* (149807)
  01: ******************* (149894)
  02: ******************************* (239495)
  03: *********************************************** (356239)
  04: ********************************************** (351916)
  05: ********************************************* (346121)
  06: ************************************************ (362635)
  07: ************************************************** (377293)
  08: ********************************************* (343376)
  .....

Queries per hour:
  00: ********* (325710)
  01: ********** (363579)
  02: ******** (304630)
  03: ******** (302274)
  04: ******** (296872)
  05: ******** (295430)
  .....

Over the course of my IT career I can’t recall how many times I’ve been asked IF a record is in use and WHO is using it. To help answer that question you can add the “–matrix” option to print domain names along with the names / IPs of the clients requesting those records. This produces a list similar to this:

prefetch.net
  |-- leonard.prefetch.net 87656
  |-- howard.prefetch.net 23456
  |-- bernadette.prefetch.net 3425

The top entry is the domain being requested and the entries below it are the clients asking questions about it. I’m looking to add the record type requested to the resolution matrix as well as –start and –end arguments to allow data to be summarized during a specific time period. Shoot me a pull request if you enhance the script or see a better way to do something.

The subtleties between the NXDOMAIN, NOERROR and NODATA DNS response codes

This past weekend I spent some time troubleshooting a DNS timeout issue. During my debugging session I noticed some subtle differences when querying an A and AAAA record with dig:

$ dig  +answer @216.92.61.30 prefetch.net a
.......
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9624
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

$ dig  +answer @216.92.61.30 prefetch.net aaaa
.......
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16528
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

When I was interpreting the results I was expecting dig to provide a response code of NODATA when I asked the DNS server for a resource record that didn’t exist. Obviously that didn’t happen. This led me to ask myself what is the technical difference between NODATA, NOERROR and NXDOMAIN? Let’s take a peak at the descriptions of each in RFC 1035:

RCODE           Response code - this 4 bit field is set as part of
                responses.  The values have the following
                interpretation:

                0               No error condition

                1               Format error - The name server was
                                unable to interpret the query.

                2               Server failure - The name server was
                                unable to process this query due to a
                                problem with the name server.

                3               Name Error - Meaningful only for
                                responses from an authoritative name
                                server, this code signifies that the
                                domain name referenced in the query does
                                not exist.

NXDOMAIN (RCODE 3 above) is pretty straight forward. When this code is present the record being requested doesn’t exist in any shape or form:

$ dig +answer @216.92.61.30 gak.prefetch.net a

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 8782
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Now what about NODATA and NOERROR? Well I found out there isn’t an RCODE associated with NODATA. This value is computed based on the answers to one or more questions and dig represents NODATA by displaying NOERROR with an ANSWER of zero. So what does NOERROR with an ANSWER of 0 actually represent? It means one or more resource records exist for this domain but there isn’t a record matching the resource record type (A, AAAA, MX, etc.). This was a useful clarification for me and helped me isolate and fix the issue I was debugging. Sometimes the devil is in the details.

Bind’s strict zone checking feature is part of CentOS 6

I recently moved a bind installation from CentOS 5 to CentOS 6. As part of the move I built out a new server with CentOS 6, staged the bind chroot packages and then proceeded to copy all of the zone files from the CentOS 5 server to the CentOS 6 server. Once all the pieces were in place I attempted to start up bind. This failed, and I was greeted with the following error:

$ service named start
Starting named:
Error in named configuration: [FAILED]

There wasn’t anything in /var/log/messages to specifically state what the problem was, though when I reviewed the bind log file I noticed there were several “not loaded due to errors” messages in it:

$ grep “not loaded due to errors” named.log

07-Jan-2012 21:00:03.505 general: error: zone prefetch.net/IN: NS 'ns1.prod.prefetch.net' has no address records (A or AAAA)
07-Jan-2012 21:00:03.505 general: error: zone prefetch/IN: NS 'ns2.prod.prefetch.net' has no address records (A or AAAA)
07-Jan-2012 21:00:03.505 general: error: zone prefetch/IN: not loaded due to errors.

After reviewing the errors I noticed that the problematic zone files (I was not the original author of these) were configured to use forward references to entries in subzone files. This is a no no, and it looks like CentOS 5 bind allows you to use forward references and CentOS 6 bind does not. To allow me to bring up the server while I tracked down all of the offending zone files I set DISABLE_ZONE_CHECKING to yes in /etc/sysconfig/named:

$ grep DISABLE /etc/sysconfig/named
DISABLE_ZONE_CHECKING=”yes”

This allowed me to test the server to make sure it worked, and I will get the zone files corrected and run through a zone file consistency check utility in the coming days. If you are moving from CentOS 5 to CentOS 6 you might want to watch out for this (ideally you would already have properly structured zone files!).

Configuring NSCD to cache DNS host lookups

I haven’t really spent that much time configuring nscd, so I thought I would take a crack at it this morning while sipping my cup of joe.

Looking at one of my production hosts, I queried for the “host” cache statistics. This is the nscd cache which keeps DNS lookups. With the nscd daemon running, you can query the size / performance of the caches with the -g flag.


$ nscd -g   
CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 0
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

         STATISTICS:
         positive hits: 0
         negative hits: 0
         positive misses: 0
         negative misses: 0
         total entries: 0
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:        0.0

Ugh! No bueno! So, out of the box, nscd isn’t configured to cache anything. This means that every request this machines does is hitting a DNS server in /etc/resolv.conf. This adds overhead to our DNS servers, and increases the time the applications running on this box have to wait to do something useful. Looking at the configuration options for the “host” cache…


$ grep hosts /etc/nscd.conf 
        enable-cache            hosts           yes
        positive-time-to-live   hosts           0
        negative-time-to-live   hosts           0
        keep-hot-count          hosts           20
        check-files             hosts           yes

Hm. So positive-time-to-live is set to zero. Looking at the man page for /etc/nscd.conf…

positive-time-to-live cachename value
Sets the time-to-live for positive entries (successful
queries) in the specified cache. value is in integer
seconds. Larger values increase cache hit rates and
reduce mean response times, but increase problems with
cache coherence. Note that sites that push (update) NIS
maps nightly can set the value to be the equivalent of
12 hours or more with very good performance implica-
tions.

Ok, so lets set the cache age here to 60 seconds. It seems like a decent starting value… After making this change, and restarting the daemon, here are some performance statistics of the host cache.


CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 60
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

        STATISTICS:
         positive hits: 143
         negative hits: 1
         positive misses: 20
         negative misses: 41
         total entries: 20
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:       70.2

Crazy. Enabling only a 60s cache, we are now performing 70% less DNS lookups. This is going to have a significant performance improvement. By default, the setting keep-hot-count is set to 20. This is the number of objects allowed in the “hosts” cache. Looking at the man page for nscd.conf…


keep-hot-count cachename value

This attribute allows the administrator to set the
number of entries nscd(1M) is to keep current in the
specified cache. value is an integer number which should
approximate the number of entries frequently used during
the day.

So, raising positive-time-to-live to say, 5 minutes wont have much value unless keep-hot-count is also raised. The cache age, and the number of objects within the cache both need to be increased. Doing so will help keep your DNS servers idle, and applications happy.

Using dnscap to debug DNS problems on Linux hosts

DNS can often make a SysAdmins life difficult, since a misconfigured entry or a broken authoritative DNS server will cause things to fail in odd ways. If you are fortunate enough to use Linux on your servers and desktops, you have a slew of utilities available to look at problems. I’ve discussed a few of my favourite DNS debugging utilities in past posts, and recently added the dnscap utility to this list.

Dnscap is a command line utility that allows you to view ALL of the DNS requests sent over an interface in a dig-like or binary format. While tcpdump and company display traffic to UDP and TCP port 53, dnscap will actually decode the entries and give you everything you need to debug an issue in one place.

To use this super useful tool you can run it with the “-i” option, the interface to monitor along with the -g (dump the output in dig format) or “-b” (dump the output in binary) options:

$ dnscap -i eth0 -g

;@ 2011-01-26 16:33:21.892326 - 56 octets via eth0 (msg #0)
;: [192.168.144.91]:56239 -> [192.168.86.2]:53
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62131
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;;	google.com, type = A, class = IN
--
;@ 2011-01-26 16:33:21.896426 - 240 octets via eth0 (msg #1)
;: [192.168.86.2]:53 -> [192.168.144.91]:56239
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62131
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 4, ADDITIONAL: 4
;;	google.com, type = A, class = IN
google.com.		1m31s IN A	74.125.157.99
google.com.		1m31s IN A	74.125.157.104
google.com.		1m31s IN A	74.125.157.147
google.com.		1d6h57m32s IN NS  ns2.google.com.
google.com.		1d6h57m32s IN NS  ns3.google.com.
google.com.		1d6h57m32s IN NS  ns4.google.com.
google.com.		1d6h57m32s IN NS  ns1.google.com.
ns1.google.com.		1d6h51m10s IN A  216.239.32.10
ns2.google.com.		1d6h51m10s IN A  216.239.34.10

The utility will then display all of the DNS requests on your console, and you can review the detailed request / SOA data along with the record information. This is extremely handy for debugging problems, and I'm glad I came across this awesome little utility!

Configuring a caching only DNS server on Solaris hosts

While investigating a performance issue a few weeks back, I noticed that a couple of our Solaris hosts were sending 10s of thousands of DNS requests to our authoritative DNS servers. Since the application was broken and unable to cache DNS records, I decided to configure a local caching only DNS server to reduce load on our DNS servers.

Creating a caching only name server on a Solaris host is a piece of cake. To begin, you will need to create a directory to store the bind zone files:

$ mkdir -p /var/named/conf

After this directory is created, you will need to place the 127.0.0.1, localhost and root.hints file in the conf directory. You can grab the 127.0.0.1 and localhost files from my site, and the root.hints file can be generated with the dig utility:

$ dig @a.root-servers.net . ns > /var/named/conf/root.hints

Next you will need to create a BIND configuration file (a sample bind configuration file is also available on my site). The BIND packages that ship with Solaris check for this file in /etc/named.conf by default, so it’s easiest to create it there (you can also hack the SMF start script, but that can get overwritten in the future and wipe out your changes). To start the caching only DNS server, you can enable the dns/server SMF service:

$ svcadm enable dns/server

If things started up properly, you should see log entries similar to the following in /var/adm/messages:

Jun 18 10:26:57 server named[7819]: [ID 873579 daemon.notice] starting BIND 9.6.1-P3
Jun 18 10:26:57 server named[7819]: [ID 873579 daemon.notice] built with –prefix=/usr –with-libtool –bindir=/usr/sbin –sbindir=/usr/sbin –libdir=/usr/lib/dns –sysconfdir=/etc –localstatedir=/var –with-openssl=/usr/sfw –enable-threads=yes –enable-devpoll=yes –enable-fixed-rrset –disable-openssl-version-check -DNS_RUN_PID_DIR=0

To test the caching only DNS server, you can use our trusty friend dig:

$ dig @127.0.0.1 a cnn.com

If that returns the correct A record, it’s a safe bet that the caching only name server is doing its job! To configure the server to query the local DNS server, you will need to replace the nameserver entries in /etc/resolv.conf with the following:

nameserver 127.0.0.1

This will force resolution to the DNS server bound to localhost, and allow the local machine to cache query responses. DNS caching is good stuff, and setting this up on a Solaris machine is a piece of cake!