Exporting Bind query statistics though XML and JSON

Bind 9.10 introduced a statistics server which exports a number of useful metrics through a web UI, XML and JSON. The statistics server is configured via the “statistics-channels” directive which contains the ip and port to export statistics on and an ACL to control who can read statistics from the server. Here is a sample configuration for reference:

acl "stats_hosts" {
        192.168.1.0/24;
};

statistics-channels { 
        inet 10.10.0.1  port 8080 allow { stats_hosts; }; 
};

Once the statistics server is enabled you can view the statistics in a web browser by surfing to the IP:PORT the server is configured to export statistics through. To retrieve statistics through XML or JSON you can append “/xml” or “/json” to the URL:

# Retrieve statistics through XML
$ curl http://bind:8080/xml

# Retrieve statistics through JSON
$ curl http://bind:8080/json

The statistics server exports several useful metrics. To view everything you can pipe the output of curl to jq:

$ curl -j http://bind:8080/json 2>/dev/null | jq '.' | more
{
  "json-stats-version": "1.2",
  "boot-time": "2017-09-10T13:24:35.411Z",
  "config-time": "2017-09-10T13:24:35.484Z",
  "current-time": "2017-09-10T13:35:44.401Z",
  "version": "9.11.2",
  "opcodes": {
    "QUERY": 389,
    "IQUERY": 0,
    "STATUS": 0,
    .....

If you want to get specific fields you can can adjust the filter passed to jq. To get just the query response codes you can retrieve the rcodes field:

$ curl -j http://bind:8080/json 2>/dev/null | jq '.rcodes'
{
  "NOERROR": 307,
  "FORMERR": 0,
  "SERVFAIL": 0,
  "NXDOMAIN": 0,
  "NOTIMP": 0,
  "REFUSED": 0,
  .....
}

To get the types of queries sent to the server you can retrieve qtypes:

$ curl -j http://bind:8080/json 2>/dev/null | jq '.qtypes'
{
  "A": 369,
  "NS": 1,
  "PTR": 1,
  "MX": 1,
  "AAAA": 11
}

To get overall name server statistics you can grab nsstats:

$ curl -j http://bind:8080/json 2>/dev/null | jq '.nsstats'
{
  "Requestv4": 385,
  "ReqEdns0": 361,
  "RecQryRej": 8,
  "Response": 385,
  "RespEDNS0": 361,
  "QrySuccess": 369,
  "QryAuthAns": 17,
  "QryNoauthAns": 360,
  "QryNxrrset": 8,
  "QryRecursion": 3,
  "QryFailure": 8,
  "QryUDP": 377
}

The statistics server also exports zone data and zone, network and memory statistics. Funneling this data into metricbeats or prometheus and using kibana and grafana to visualize it can provide some amazing insight into your DNS infrastructure.

Monitoring DNS domain name expiration with dns-domain-expiration-checker

Several years ago I wrote a simple bash script to check the expiration date of the DNS domains I own. At the time I wrote this purely for my own needs but after receiving 100s of e-mails from folks who were using it (and submitting patches) I decided to enhance it to be more useful. As time has gone on Registrar WHOIS data formats have changed and I came to the realization that there is no standard time format for expiration records. I needed a more suitable solution so I spent last weekend re-writing my original script in Python. The new version is available on github and solves all of the issues I previously encountered.

To work around the Registrar WHOIS data format issue I created a list of strings which can be easily extended as new ones are encountered (or changed):

EXPIRE_STRINGS = [ "Registry Expiry Date:",
                   "Expiration:",
                   "Domain Expiration Date"
                 ]

These strings are checked against the WHOIS data returned from `which whois` and if a match is found I save off $NF which should be the date. To get around the date formatting issues I pulled in the dateutil module which does an AMAZING job of normalizing dates. This module allows me to feed it random date formats which are then normalized to datetime objects which I can perform math on. The github README contains several examples showing how to use the script. The most basic form allows you to see expiration data for a domain or a set of domains in a file:

$ dns-domain-expiration-checker.py --domainname prefetch.net --interactive
Domain Name                Registrar             Expiration Date                 Days Left
prefetch.net               DNC Holdings, Inc.    2020-06-23 16:56:05             1056

The script also provides SMTP alerting and Nagios support will be added in the near future. If you use the script and encounter any bugs please shoot me an issue on github.

Creating Bind query log statistics with dnsrecon

A month or two back I was investigating a production issue and wanted to visualize our Bind query logs. The Bind statistics channel looked useful but there wasn’t enough data to help me troubleshoot my issue. In the spirit of software re-use I looked at a few opensource query log parsing utilities. The programs I found used MySQL and once again they didn’t have enough data to fit my needs. My needs were pretty simple:

– Summarize queries by record type
– Show the top # records requested
– Show the top # of clients querying the server
– Print DNS query histograms by minute
– Print DNS query histograms by hour
– Have an extended feature to list all of the clients querying a record
– Allow the records to be filtered by time periods

Instead of mucking around with these solutions I wrote dnsrecon. Dnsrecon takes one or more logs as an argument and produces a compact DNS query log report which can be viewed in a terminal window:

$ dnsrecon.py logs/* --histogram

Processing logfile ../logs/named.queries
Processing logfile ../logs/named.queries.0
Processing logfile ../logs/named.queries.1
Processing logfile ../logs/named.queries.2
Processing logfile ../logs/named.queries.3
Processing logfile ../logs/named.queries.4
Processing logfile ../logs/named.queries.5
Processing logfile ../logs/named.queries.6
Processing logfile ../logs/named.queries.7
Processing logfile ../logs/named.queries.8
Processing logfile ../logs/named.queries.9

Summary for 05-Nov-2016 10:31:36.230 - 08-Nov-2016 14:15:51.426

Total DNS_QUERIES processed : 9937837
  PTR    records requested : 6374013
  A      records requested : 3082344
  AAAA   records requested : 372332
  MX     records requested : 32593
  TXT    records requested : 23508
  SRV    records requested : 19815
  SOA    records requested : 19506
  NS     records requested : 6661
  DNSKEY records requested : 2286

Top  100  DNS names requested:
  prefetch.net : 81379
  sheldon.prefetch.net : 75244
  penny.prefetch.net : 54637
  ..... 

Top  100  DNS clients:
  blip.prefetch.net :  103680
  fmep.prefetch.net :  92486
  blurp.prefetch.net : 32456
  gorp.prefetch.net : 12324
  .....

Queries per minute:
  00: ******************* (149807)
  01: ******************* (149894)
  02: ******************************* (239495)
  03: *********************************************** (356239)
  04: ********************************************** (351916)
  05: ********************************************* (346121)
  06: ************************************************ (362635)
  07: ************************************************** (377293)
  08: ********************************************* (343376)
  .....

Queries per hour:
  00: ********* (325710)
  01: ********** (363579)
  02: ******** (304630)
  03: ******** (302274)
  04: ******** (296872)
  05: ******** (295430)
  .....

Over the course of my IT career I can’t recall how many times I’ve been asked IF a record is in use and WHO is using it. To help answer that question you can add the “–matrix” option to print domain names along with the names / IPs of the clients requesting those records. This produces a list similar to this:

prefetch.net
  |-- leonard.prefetch.net 87656
  |-- howard.prefetch.net 23456
  |-- bernadette.prefetch.net 3425

The top entry is the domain being requested and the entries below it are the clients asking questions about it. I’m looking to add the record type requested to the resolution matrix as well as –start and –end arguments to allow data to be summarized during a specific time period. Shoot me a pull request if you enhance the script or see a better way to do something.

The subtleties between the NXDOMAIN, NOERROR and NODATA DNS response codes

This past weekend I spent some time troubleshooting a DNS timeout issue. During my debugging session I noticed some subtle differences when querying an A and AAAA record with dig:

$ dig  +answer @216.92.61.30 prefetch.net a
.......
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9624
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

$ dig  +answer @216.92.61.30 prefetch.net aaaa
.......
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16528
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

When I was interpreting the results I was expecting dig to provide a response code of NODATA when I asked the DNS server for a resource record that didn’t exist. Obviously that didn’t happen. This led me to ask myself what is the technical difference between NODATA, NOERROR and NXDOMAIN? Let’s take a peak at the descriptions of each in RFC 1035:

RCODE           Response code - this 4 bit field is set as part of
                responses.  The values have the following
                interpretation:

                0               No error condition

                1               Format error - The name server was
                                unable to interpret the query.

                2               Server failure - The name server was
                                unable to process this query due to a
                                problem with the name server.

                3               Name Error - Meaningful only for
                                responses from an authoritative name
                                server, this code signifies that the
                                domain name referenced in the query does
                                not exist.

NXDOMAIN (RCODE 3 above) is pretty straight forward. When this code is present the record being requested doesn’t exist in any shape or form:

$ dig +answer @216.92.61.30 gak.prefetch.net a

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 8782
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Now what about NODATA and NOERROR? Well I found out there isn’t an RCODE associated with NODATA. This value is computed based on the answers to one or more questions and dig represents NODATA by displaying NOERROR with an ANSWER of zero. So what does NOERROR with an ANSWER of 0 actually represent? It means one or more resource records exist for this domain but there isn’t a record matching the resource record type (A, AAAA, MX, etc.). This was a useful clarification for me and helped me isolate and fix the issue I was debugging. Sometimes the devil is in the details.

Bind’s strict zone checking feature is part of CentOS 6

I recently moved a bind installation from CentOS 5 to CentOS 6. As part of the move I built out a new server with CentOS 6, staged the bind chroot packages and then proceeded to copy all of the zone files from the CentOS 5 server to the CentOS 6 server. Once all the pieces were in place I attempted to start up bind. This failed, and I was greeted with the following error:

$ service named start
Starting named:
Error in named configuration: [FAILED]

There wasn’t anything in /var/log/messages to specifically state what the problem was, though when I reviewed the bind log file I noticed there were several “not loaded due to errors” messages in it:

$ grep “not loaded due to errors” named.log

07-Jan-2012 21:00:03.505 general: error: zone prefetch.net/IN: NS 'ns1.prod.prefetch.net' has no address records (A or AAAA)
07-Jan-2012 21:00:03.505 general: error: zone prefetch/IN: NS 'ns2.prod.prefetch.net' has no address records (A or AAAA)
07-Jan-2012 21:00:03.505 general: error: zone prefetch/IN: not loaded due to errors.

After reviewing the errors I noticed that the problematic zone files (I was not the original author of these) were configured to use forward references to entries in subzone files. This is a no no, and it looks like CentOS 5 bind allows you to use forward references and CentOS 6 bind does not. To allow me to bring up the server while I tracked down all of the offending zone files I set DISABLE_ZONE_CHECKING to yes in /etc/sysconfig/named:

$ grep DISABLE /etc/sysconfig/named
DISABLE_ZONE_CHECKING=”yes”

This allowed me to test the server to make sure it worked, and I will get the zone files corrected and run through a zone file consistency check utility in the coming days. If you are moving from CentOS 5 to CentOS 6 you might want to watch out for this (ideally you would already have properly structured zone files!).

Configuring NSCD to cache DNS host lookups

I haven’t really spent that much time configuring nscd, so I thought I would take a crack at it this morning while sipping my cup of joe.

Looking at one of my production hosts, I queried for the “host” cache statistics. This is the nscd cache which keeps DNS lookups. With the nscd daemon running, you can query the size / performance of the caches with the -g flag.


$ nscd -g   
CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 0
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

         STATISTICS:
         positive hits: 0
         negative hits: 0
         positive misses: 0
         negative misses: 0
         total entries: 0
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:        0.0

Ugh! No bueno! So, out of the box, nscd isn’t configured to cache anything. This means that every request this machines does is hitting a DNS server in /etc/resolv.conf. This adds overhead to our DNS servers, and increases the time the applications running on this box have to wait to do something useful. Looking at the configuration options for the “host” cache…


$ grep hosts /etc/nscd.conf 
        enable-cache            hosts           yes
        positive-time-to-live   hosts           0
        negative-time-to-live   hosts           0
        keep-hot-count          hosts           20
        check-files             hosts           yes

Hm. So positive-time-to-live is set to zero. Looking at the man page for /etc/nscd.conf…

positive-time-to-live cachename value
Sets the time-to-live for positive entries (successful
queries) in the specified cache. value is in integer
seconds. Larger values increase cache hit rates and
reduce mean response times, but increase problems with
cache coherence. Note that sites that push (update) NIS
maps nightly can set the value to be the equivalent of
12 hours or more with very good performance implica-
tions.

Ok, so lets set the cache age here to 60 seconds. It seems like a decent starting value… After making this change, and restarting the daemon, here are some performance statistics of the host cache.


CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 60
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

        STATISTICS:
         positive hits: 143
         negative hits: 1
         positive misses: 20
         negative misses: 41
         total entries: 20
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:       70.2

Crazy. Enabling only a 60s cache, we are now performing 70% less DNS lookups. This is going to have a significant performance improvement. By default, the setting keep-hot-count is set to 20. This is the number of objects allowed in the “hosts” cache. Looking at the man page for nscd.conf…


keep-hot-count cachename value

This attribute allows the administrator to set the
number of entries nscd(1M) is to keep current in the
specified cache. value is an integer number which should
approximate the number of entries frequently used during
the day.

So, raising positive-time-to-live to say, 5 minutes wont have much value unless keep-hot-count is also raised. The cache age, and the number of objects within the cache both need to be increased. Doing so will help keep your DNS servers idle, and applications happy.