Brendan Gregg amazes me again

I spent some time last night catching up with mailing lists, and saw that Brendan Gregg recently added DTrace SDT probes to one of the Javascript engines. If you don’t know who Brendan is, he is a brilliant guy, and the author of the DTraceToolkit among other things. I always love reading about his work, and I find it interesting that he recently joined Sun Microsystems. Sun is extremely lucky to have such a great guy working for them, and I can’t say enough about how much I respect Brendan and his work.

NFSv3 guarded writes

While debugging another NFSv3 problem this week, I came across a create procedure with the “GUARDED” flag set:

$ snoop -d hme0 host netapp1

yappy -> netapp1      NFS C CREATE3 FH=C5D2 (GUARDED) file.dat
     netapp1 -> yappy NFS R CREATE3 OK FH=E700

This was the first time I have reviewed an NFSv3 packet capture with the GUARDED flag set, so I decided to read RFC 1813 to see how the create procedure should be implemented. Here is what the RFC says about the GUARDED flag:

“Creation modes:
One of UNCHECKED, GUARDED, and EXCLUSIVE. UNCHECKED means that the file should be created without checking for the existence of a duplicate file in the same directory. In this case, how.obj_attributes is a sattr3 describing the initial attributes for the file. GUARDED specifies that the server should check for the presence of a duplicate file before performing the create and should fail the request with NFS3ERR_EXIST if a duplicate file exists. If the file does not exist, the request is performed as described for UNCHECKED. EXCLUSIVE specifies that the server is to follow exclusive creation semantics, using the verifier to ensure exclusive creation of the target. No attributes may be provided in this case, since the server may use the target file metadata to store the createverf3 verifier.”

This is rather interesting, and it makes sense given that the mode passed as the second argument to creat() should be transparently passed through to the NFSv3 create procedure. RFC 1813 is extremely well writtten, and a great read if your looking to learn more about NFS!

Asynchronous writes in NFSv3

While debugging an interesting NFSv3 problem last week, I came across the following information in one of my snoop captures:

$ snoop -d hme0 host netapp1
yappy -> netapp1 NFS C WRITE3 FH=94EC at 0 for 824 (ASYNC)
netapp1 -> yappy NFS R WRITE3 OK 824 (FSYNC)
yappy -> nfsfly1 NFS C WRITE3 FH=30D9 at 0 for 285 (ASYNC)
netapp1 -> yappy NFS R WRITE3 OK 285 (FSYNC)
yappy -> nfsfly1 NFS C WRITE3 FH=36C6 at 0 for 231 (ASYNC)
netapp1 -> yappy NFS R WRITE3 OK 231 (FSYNC)

Prior to reviewing the snoop capture, I didn’t know that NFSv3 supported asynchronous I/O. While reading through various pieces of information, I came across the following nugget of information:

“NFS Version 3 introduces the concept of “safe asynchronous writes.” A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write.”

This explained why we were seeeing what we were seeing (more on that later), and is one of those things to keep in the back of your mind. One interesting tidbit to note. The Solaris snoop command displays the string “ASYNC” for an asynchronous write procedure, but RFC 1813 uses the term “UNSTABLE” to reference asynchronous I/O procedures. It’s a small nit, but the term mismatch can lead to some confusion if your using the RFCs verbatim to interpret the output of snoop “-vv”.

Limiting how much memory BIND can use

I support BIND on a few servers, and when run as a caching name server, BIND can consume a fair amount of memory if you have lots of clients. There are two ways to restrict the amount of memory BIND uses. The first method, which is described in Pro DNS and BIND, is to set the “datasize” variable to the total amount of memory you want to allocate to BIND. The book provides an awesome description of this variable:

“The maximum amount of data memory the server may use. The default is default. This is a hard limit on server memory usage. If the server attempts to allocate memory in excess of this limit, the allocation will fail, which may in turn leave the server unable to perform DNS service. Therefore, this option is rarely useful as a way of limiting the amount of memory used by the server, but it can be used to raise an operating system data size limit that is too small by default. If you wish to limit the amount of memory used by the server, use the max-cache-size and recursive-clients options instead.”

The datasize variable is definitely useful in some cases, but can lead to server failures if BIND attempts to allocate memory above the threshold defined by datasize. A better method to limit memory is to use the max-cache-size variable, which will cause BIND to expire entries when it approaches the memory limit defined by max-cache-size. The Pro DNS and BIND book provides the following description of max-cache-size:

“The maximum amount of memory to use for the server’s cache, in bytes. When the amount of data in the cache reaches this limit, the server will cause records to expire prematurely so that the limit is not exceeded. In a server with multiple views, the limit applies separately to the cache of each view. The default is unlimited, meaning that records are purged from the cache only when their TTLs expire.”

If you manage servers running BIND, I highly recommend picking up a copy of Pro DNS and BIND. It is an AWESOME book, and should be on every DNS admins bookshelf.

New version of ssl-cert-check

I got a couple of patches for ssl-cert-check, and released version 3.4 to my website. The patches address a couple of annoying bugs, and I changed the global binary paths to to work by default on Solaris, BSD and Solaris systems. If you haven’t used ssl-cert-check before, you can check out my article Proactively handling SSL certificate expiration with ssl-cert-check to see what it does.

Determining if an application is using random vs. sequential I/O

The DTraceToolkit comes with two super useful scripts to observe the “randonmess” or “sequentialness” of an application. The first script is iopattern, which provides a system-wide view of random and sequential I/O, the total amount of I/O generated, and an I/O size distribution:

$ iopattern 5

%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
 100    0      9    512   4096   1706      0     15
 100    0     19    512   1024    592      0     11
 100    0      4    512    512    512      0      2
 100    0      8    512   4096   1856      0     14

The second script is seeksize.d. Seeksize.d provides a histogram of the number of sectors traversed between I/O operations for each process on the system:

$ seeksize.d

    1615  /usr/lib/ssh/sshd

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@                                  2
            8192 |@@@@@@@@@@@                              3
           16384 |@@@@@@@@@@@                              3
           32768 |@@@@@@@@@@@                              3
           65536 |                                         0

    [ ..... ]

While discussing these scripts last night, I didn’t provide many details on how they actually work. Each script uses the io provider to detect when an I/O occurs, and the value of “b_blkno” to determine which block is being read into memory or written to disk. Since DTrace comes with a nifty type called an aggregation, the previous block read can be compared with the current block read (both stored as thread local variables), and the result stored in the aggregation. This is pretty interesting stuff, and I apologize for not providing further details during my talk.