Finding busy disks with iostat

The iostat(1M) utility provides several I/O statistics, which can be useful for analyzing I/O workloads and troubleshooting performance problems. When reviewing I/O problems, I usually start by reviewing the number of reads and writes to a device, which are available in iostat’s “r/s” and “w/s” columns:

$ iostat -zxnM 5

                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

Once I know how many reads and writes are being issued, I like to find the number of Megabytes read and written to each device. This information is available in iostat’s “Mr/s” and “Mw/s” columns:

$ iostat -zxnM 5

                    extended device statistics              
   r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

After reviewing these items, I like to check iostat’s “wait” value to see the I/O queue depth for each device:

$ iostat -zxnM 5

                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

To see how these can be applied to a real problem, I captured the following data from device c0t0d0 a week or two back:

$ iostat -zxnM 5

               extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2   71.2    0.0    7.7 787.3  2.0 11026.8   28.0 100 100 c0t0d0

Device c0t0d0 was overloaded, had 787 I/O operations waiting to be serviced, and was causing application latecy (since the application in question performed lots of reads/writes, and the files were open O_SYNC). Once iostat returned the above statistics, I used ps(1) to find the processes that were causing the excessive disk activity, and used kill(1) to terminate them!

* The Solaris iostat utility was used to produce this output.
** The first iostat line contains averages since the system was booted, and should be ignored.

mpstat’s wi (wait on I/O) statistic

I have used mpstat over the years to check system performance metrics, and always wanted to know what went on behind the scenes. With the recent release of opensolaris, I can finally find out how mpstat gathers the statistics it displays (it uses the kstat interfaces). One statistic that has always been confusing is “wt.” According to the mpstat man page, the wt column means:

wt percent wait time

This definition is rather vague, so I read Jim Mauro’s article to get additional information on what wi meant. After reading Jim’s article, I started to think that wt was worthless, but decided to verify my hypothesis with the experts in the opensolaris performance forums. 2-minutes after posting my wi question, I received the following reply:

From: johansen
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Wait I/O Calculations

%wio is a meaningless statistic. On Solaris 9, each thread that was blocked in biowait(), waiting for I/O, would lead to a counter being incremented. When the I/O completed, the counter was decremented. This counter was what was used to generate %wio. The problem is that we attempted to normalize the I/O-wait as portion of CPU time, when it is really the count of threads sleeping for block-I/O. Attempting to decompose the number of threads blocked for I/O into a percentage of CPU-time-used turns out to be both confusing and not particularly useful.

Becuase %wio has lead to so much confusion, and didn’t represent any real quantity of CPU time, we’ve removed the statistic in Solaris 10. %wio will now always return zero. If you’re interested more useful I/O statistics, check out iostat(1M), which will correctly show you the percentage of time each I/O transaction spends waiting to be serviced. This message posted from opensolaris.org

This re-affirmed by beliefs, and comes from the people who wrote the code that runs your servers. :) The opensolaris project is rad!

Solaris B_FAILFAST messages

After patching my box a few months back, I noticed the following messages on the console after I rebooted:

Apr 15 15:23:15 winnie md_mirror: [ID 437521 kern.info] NOTICE: md: d1: B_FAILFAST I/O disabled
Apr 15 15:23:15 winnie md_mirror: [ID 437521 kern.info] NOTICE: md: d2: B_FAILFAST I/O disabled
Apr 15 15:23:15 winnie md_mirror: [ID 437521 kern.info] NOTICE: md: d1: B_FAILFAST I/O disabled
Apr 15 15:23:15 winnie md_mirror: [ID 437521 kern.info] NOTICE: md: d2: B_FAILFAST I/O disabled

Due to the nature of the message, I grew somewhat frantic, and began feverishly scouring the Internet for answers. It didn’t take long to stumble across the following Sunsolve article. According to sunsolve, these are not errors, and the metadevices are not faulted:

“These are not errors. These are informational (NOTICE) messages telling you that the driver underlying your metadevices does not support the B_FAILFAST option and, as a result, that capability will not be active for speedier recovery. There is no way to disable these messages.”

Growing Solaris UFS file systems

I recently needed to grow a Solaris UFS file system, and accomplished this with the growfs(1m) utility. The growfs(1m) utility takes two arguments. The first argument to growfs ( the value passed to “-M” ) is the mount point of the file system to grow. The second argument is the raw device that backs this mount point. The following example will grow “/test” to the maximum size available on the meta device d100:

$ growfs -M /test /dev/md/rdsk/d100

To see how many sectors will be available on d100 after the grow operation completes, you can run newfs with the “-N” option, and compare that with the current value of df (1m):

$ newfs -N /dev/md/dsk/d100
/dev/md/rdsk/d0: 232331520 sectors in 56944 cylinders of 16 tracks, 255 sectors
113443.1MB in 2191 cyl groups (26 c/g, 51.80MB/g, 6400 i/g)

This will report the number of sectors, cylinders and MBs that would be allocated if a new file system was created on meta device d100. As always, test everything on a non critical system prior to making changes to critical boxen.

Solaris’s maxcontig setting

After reading through the following UFS tuning information:

http://www.solarisinternals.com/si/reading/fs2/fs2.html

I started playing with the UFS “maxcontig” tunable. This value controls the number of file system blocks that will be read or written in a single operation. Each UFS file system contains a maxcontig value, which can be printed with the Solaris “fstyp” command:

$ fstyp -v /dev/md/dsk/d0 |more

ufs
magic   11954   format  dynamic time    Fri Jan 14 09:47:19 2005
sblkno  16      cblkno  24      iblkno  32      dblkno  832
sbsize  2048    cgsize  8192    cgoffset 128    cgmask  0xfffffff0
ncg     2191    size    116165760       blocks  114377853
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
frag    8       shift   3       fsbtodb 1
minfree 1%      maxbpg  2048    optim   time
maxcontig 16    rotdelay 0ms    rps     90
csaddr  832     cssize  35840   shift   9       mask    0xfffffe00
ntrak   16      nsect   255     spc     4080    ncyl    56944
cpg     26      bpg     6630    fpg     53040   ipg     6400
nindir  2048    inopb   64      nspf    2
nbfree  7624085 ndir    12806   nifree  13904878        nffree  90216
cgrotor 1454    fmod    0       ronly   0       logbno  1824
version 0

To see if maxcontig needs to be increased, you can run “iostat,” and watch the transfer sizes:

$ iostat -zxn 5

                 extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   63.2    0.0 7782.9 52.3  2.0  827.9   31.4  97  99 c0t0d0
    0.2   63.2    1.6 7782.9 52.9  2.0  834.1   31.4  98 100 c0t2d0
    0.2   62.4    1.6 7782.5  0.2 54.7    3.6  873.1  23 100 d0
    0.0   62.4    0.0 7782.5  0.0 54.1    0.0  866.7   0  99 d1
    0.2   62.4    1.6 7782.5  0.0 54.7    0.0  873.0   0 100 d2

If we divide writes per second (w/s) by the total bytes written (kr/s), we can
derive the average size of each physical write:

$ bc
scale=2
7782/63
123.52

Give or take a few bytes, we are pushing maxcontig bytes during each write operation. If you have sequential workloads, increasing the value of maxcontig may allow your Solaris box to read or write more data at once (reducing the total number of I/O operations). You can adjust the size of maxcontig with the “tunefs” utility:

$ tunefs -a 128 /dev/md/dsk/d0
maximum contiguous block count changes from 16 to 128

This will cause 128 file system blocks (1 MB) to be read and written with each I/O operation. In order for this value to be effective, you need to increase the maximum size of a SCSI/SVM I/O operation. This is done by adding the following tunables to /etc/system:

set maxphys=1048576
set md:md_maxphys=1048576

ALL tunables should be tested on a development/QE box before implementing on important systems. I tried bumping maxcontig to 128 on my Ultra5, and immediately saw corruption on several meta devices. Digging through sunsolve.sun.com, I learned that maxcontig can only be set to “16” on IDE devices, and “128” for SCSI devices:

http://sunsolve.sun.com/search/document.do?assetkey=1-26-23429-1&searchclause=maxcontig

Luckily the Ultra5 was a test system, so recovering was relatively straight forward. Test all tunables before you deploy them :)