Configuring NSCD to cache DNS host lookups

I haven’t really spent that much time configuring nscd, so I thought I would take a crack at it this morning while sipping my cup of joe.

Looking at one of my production hosts, I queried for the “host” cache statistics. This is the nscd cache which keeps DNS lookups. With the nscd daemon running, you can query the size / performance of the caches with the -g flag.


$ nscd -g   
CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 0
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

         STATISTICS:
         positive hits: 0
         negative hits: 0
         positive misses: 0
         negative misses: 0
         total entries: 0
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:        0.0

Ugh! No bueno! So, out of the box, nscd isn’t configured to cache anything. This means that every request this machines does is hitting a DNS server in /etc/resolv.conf. This adds overhead to our DNS servers, and increases the time the applications running on this box have to wait to do something useful. Looking at the configuration options for the “host” cache…


$ grep hosts /etc/nscd.conf 
        enable-cache            hosts           yes
        positive-time-to-live   hosts           0
        negative-time-to-live   hosts           0
        keep-hot-count          hosts           20
        check-files             hosts           yes

Hm. So positive-time-to-live is set to zero. Looking at the man page for /etc/nscd.conf…

positive-time-to-live cachename value
Sets the time-to-live for positive entries (successful
queries) in the specified cache. value is in integer
seconds. Larger values increase cache hit rates and
reduce mean response times, but increase problems with
cache coherence. Note that sites that push (update) NIS
maps nightly can set the value to be the equivalent of
12 hours or more with very good performance implica-
tions.

Ok, so lets set the cache age here to 60 seconds. It seems like a decent starting value… After making this change, and restarting the daemon, here are some performance statistics of the host cache.


CACHE: hosts

         CONFIG:
         enabled: yes
         per user cache: no
         avoid name service: no
         check file: yes
         check file interval: 0
         positive ttl: 60
         negative ttl: 0
         keep hot count: 20
         hint size: 2048
         max entries: 0 (unlimited)

        STATISTICS:
         positive hits: 143
         negative hits: 1
         positive misses: 20
         negative misses: 41
         total entries: 20
         queries queued: 0
         queries dropped: 0
         cache invalidations: 0
         cache hit rate:       70.2

Crazy. Enabling only a 60s cache, we are now performing 70% less DNS lookups. This is going to have a significant performance improvement. By default, the setting keep-hot-count is set to 20. This is the number of objects allowed in the “hosts” cache. Looking at the man page for nscd.conf…


keep-hot-count cachename value

This attribute allows the administrator to set the
number of entries nscd(1M) is to keep current in the
specified cache. value is an integer number which should
approximate the number of entries frequently used during
the day.

So, raising positive-time-to-live to say, 5 minutes wont have much value unless keep-hot-count is also raised. The cache age, and the number of objects within the cache both need to be increased. Doing so will help keep your DNS servers idle, and applications happy.

Who says Linux isn’t stable?

I have been replacing some old hardware over the past few months, and recently noticed that we had several machines with uptimes in the hundreds of days (one 800+ days). For the longest time I thought only Solaris and AIX provided this kind of stability, but over the past few years I’ve started to include Linux in this list as well. You gotta love it when you see this:

$ uptime
2:08pm up 428 days, 3:00, 1 user, load average: 0.63, 0.50, 0.36

Now if only ksplice would make it into the Enterprise distributions! That would be rad, and I’m sure some fun “my uptime is better than your uptime” threads would ensue. :)

Why the ext3 fsck’s after X days or Y mounts?

Reading through my RSS feeds, I came across the following blog post describing one Linux administrator using tune2fs to disable the “please run fsck on this file system after X days or Y mounts.”

I’ve got to admit, this is kind of annoying. I’ve taken production critical Linux boxes down for some maintenance, only to have the downtime extended +15-30 minutes because the file system was configured to run a fsck. Google searching this topic even shows other administrators trying other stupid tactics to avoid the fsck on reboot.

Is there really any value on having fsck run after some period of time? On Unix based systems (and even in Windows), fsck (or chkdisk) only runs when the kernel notices that a file system is in some sort of inconsistent state. So then I ask, why did the Linux community decide to run fsck on file systems in consistent state? ZFS has a “scrub” operation that can be ran against a dataset, but even that is comparing block level checksums. Ext2/3, RiserFS, XFS don’t perform block level checksums (btrfs does) so why the need to run fsck after some period of time? Does running fsck give folks the warm n’ fuzzies that their data is clean, or is there some deeper technical reason why this is scheduled? If you have any answers / historical data, please feel free to share. =)

Initng speeds up Linux boot times / provides service resilancy

One feature I really liked in Solaris 10 was SMF.  It provides a framework using services manifests on the system to automatically respawn services should they die off.  It handles dependencies, restarts, and a single unified command set to configure the system using svcs, svccfg, and svcadm.

Linux looks like they’ve started to integrate some of these features with a modified Init daemon that not only restarts defined services, but improves boot time.  I’m going to be checking Initng out and will post with some further findings.

Duplicate RPM names showing up in the rpm query output

I had to install some software last night on one of my 64-bit CentOS Linux hosts, and noticed that glibc was listed twice in my rpm query:

$ rpm -q -a | grep glibc-2.5-34
glibc-2.5-34
glibc-2.5-34

At first I thought my RPM package database was borked, but then it dawned on me that there are probably 32- and 64-bit packages installed. To verify this, I used a custom rpm query string that displayed the archiecture in addition to the package name and version:

$ rpm -qa –qf “%{name}-%{version}-%{release}.%{arch}\n” | grep glibc-2.5-34
glibc-2.5-34.i686
glibc-2.5-34.x86_64

This was indeed the case, and a full listing of each package showed that 32-bit libraries went into /lib, and 64-bit libraries got stashed into /lib64. I’m not sure why the default rpm query output doesn’t contain the package architecture, but it appears adding the following entry to /etc/rpm/macros fixes this (credit to the CentOS mailing list for the macro):

$ grep query /etc/rpm/macros
%_query_all_fmt %%{name}-%%{version}-%%{release}.%%{arch}

$ rpm -q -a | grep glibc-2.5
glibc-2.5-34.i686
glibc-2.5-34.x86_64