Finding busy disks with iostat

The iostat(1M) utility provides several I/O statistics, which can be useful for analyzing I/O workloads and troubleshooting performance problems. When reviewing I/O problems, I usually start by reviewing the number of reads and writes to a device, which are available in iostat’s “r/s” and “w/s” columns:

$ iostat -zxnM 5

                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

Once I know how many reads and writes are being issued, I like to find the number of Megabytes read and written to each device. This information is available in iostat’s “Mr/s” and “Mw/s” columns:

$ iostat -zxnM 5

                    extended device statistics              
   r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

After reviewing these items, I like to check iostat’s “wait” value to see the I/O queue depth for each device:

$ iostat -zxnM 5

                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

To see how these can be applied to a real problem, I captured the following data from device c0t0d0 a week or two back:

$ iostat -zxnM 5

               extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2   71.2    0.0    7.7 787.3  2.0 11026.8   28.0 100 100 c0t0d0

Device c0t0d0 was overloaded, had 787 I/O operations waiting to be serviced, and was causing application latecy (since the application in question performed lots of reads/writes, and the files were open O_SYNC). Once iostat returned the above statistics, I used ps(1) to find the processes that were causing the excessive disk activity, and used kill(1) to terminate them!

* The Solaris iostat utility was used to produce this output.
** The first iostat line contains averages since the system was booted, and should be ignored.

mpstat’s wi (wait on I/O) statistic

I have used mpstat over the years to check system performance metrics, and always wanted to know what went on behind the scenes. With the recent release of opensolaris, I can finally find out how mpstat gathers the statistics it displays (it uses the kstat interfaces). One statistic that has always been confusing is “wt.” According to the mpstat man page, the wt column means:

wt percent wait time

This definition is rather vague, so I read Jim Mauro’s article to get additional information on what wi meant. After reading Jim’s article, I started to think that wt was worthless, but decided to verify my hypothesis with the experts in the opensolaris performance forums. 2-minutes after posting my wi question, I received the following reply:

From: johansen
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Wait I/O Calculations

%wio is a meaningless statistic. On Solaris 9, each thread that was blocked in biowait(), waiting for I/O, would lead to a counter being incremented. When the I/O completed, the counter was decremented. This counter was what was used to generate %wio. The problem is that we attempted to normalize the I/O-wait as portion of CPU time, when it is really the count of threads sleeping for block-I/O. Attempting to decompose the number of threads blocked for I/O into a percentage of CPU-time-used turns out to be both confusing and not particularly useful.

Becuase %wio has lead to so much confusion, and didn’t represent any real quantity of CPU time, we’ve removed the statistic in Solaris 10. %wio will now always return zero. If you’re interested more useful I/O statistics, check out iostat(1M), which will correctly show you the percentage of time each I/O transaction spends waiting to be serviced. This message posted from opensolaris.org

This re-affirmed by beliefs, and comes from the people who wrote the code that runs your servers. :) The opensolaris project is rad!

Veritas hostid debugging

I was approached by a colleague last week to investigate a problem with Veritas Volume Manager simultaneously importing a disk group on two different nodes (this is ok if your using CVM, but we weren’t). As you can imagine, this is a BAD thing, and can lead to chaos and data corruption (depending on what gets modified). I started my investigation by running “vxdisk list DISKID” to review the disk configuration records on each node:

node1 $ vxdisk list c3t22d0s2

Device:    c3t22d0s2
devicetag: c3t22d0
type:      sliced
hostid:    dmp
disk:      name=c3t22d0s2 id=1117829023.1341.dmp
group:     name=DG id=1030035763.1312.node1
flags:     online ready private autoconfig autoimport imported
pubpaths:  block=/dev/vx/dmp/c3t22d0s4 char=/dev/vx/rdmp/c3t22d0s4
privpaths: block=/dev/vx/dmp/c3t22d0s3 char=/dev/vx/rdmp/c3t22d0s3
version:   2.2
iosize:    min=512 (bytes) max=256 (blocks)
public:    slice=4 offset=0 len=104847360
private:   slice=3 offset=1 len=2559
update:    time=1117841684 seqno=0.9
headers:   0 248
configs:   count=1 len=1865
logs:      count=1 len=282
Defined regions:
 config   priv 000017-000247[000231]: copy=01 offset=000000 enabled
 config   priv 000249-001882[001634]: copy=01 offset=000231 enabled
 log      priv 001883-002164[000282]: copy=01 offset=000000 enabled
Multipathing information:
numpaths:   4
c3t22d0s2       state=enabled   type=secondary
c3t21d0s2       state=enabled   type=primary
c4t20d0s2       state=enabled   type=secondary
c4t21d0s2       state=enabled   type=primary

node2 $ vxdisk list c2t0d45s2
Device:    c2t0d45s2
devicetag: c2t0d45
type:      sliced
hostid:    dmp
disk:      name=c2t0d45s2 id=1030485832.1770.dmp
group:     name=DG id=1030485575.1756.node2
flags:     online ready private autoconfig autoimport imported
pubpaths:  block=/dev/vx/dmp/c2t0d45s4 char=/dev/vx/rdmp/c2t0d45s4
privpaths: block=/dev/vx/dmp/c2t0d45s3 char=/dev/vx/rdmp/c2t0d45s3
version:   2.2
iosize:    min=512 (bytes) max=256 (blocks)
public:    slice=4 offset=0 len=28445760
private:   slice=3 offset=1 len=2879
update:    time=1113472254 seqno=0.49
headers:   0 248
configs:   count=1 len=2104
logs:      count=1 len=318
Defined regions:
 config   priv 000017-000247[000231]: copy=01 offset=000000 disabled
 config   priv 000249-002121[001873]: copy=01 offset=000231 disabled
 log      priv 002122-002439[000318]: copy=01 offset=000000 disabled
Multipathing information:
numpaths:   2
c2t0d45s2       state=enabled
c3t1d45s2       state=enabled

Upon reviewing the ouput, the mental alarm immediately sounded when I saw that the “hostid” was the same on both nodes. If you are new to Veritas, VxVM uses the hostid flag to indicate that a disk group (and the devices in it) is imported by a specific node, which prevents nodes with non-matching hostids from automatically importing disk groups when the “autoimport” flag is set. The hostid is stored in the file /etc/vx/volboot:

node1 $ cat /etc/vx/volboot

volboot 3.1 0.7 30
hostid dmp
end
###############################################################
###############################################################
###############################################################
###############################################################
###############################################################
###############################################################
###############################################################
#############################

After reviewing the SAN configurations that had been in place prior to me starting, I noticed that both servers were able to see the LUNs in the DG disk group. Since the autoimport flag was set, and both nodes thought they were hostid dmp, both boxes happily imported the disk group.

To resolve the problem, we deported the DG disk group from node1, and changed the hostid with the following command:

$ vxdctl hostid SOMETHING_OR_ANOTHER

Upon reviewing the vxdg man page, it looks like no data corruption would have occurred, since node1 didn’t access the volumes in the DG disk group. My work life has been filled with all kinds of interesting problems the past few weeks!

Printing PICL information

The Solaris operating system comes with the picld daemon, which allows hardware specific data to be presented in a general way. When clients read data from a PICL implementation, they can be ensured that “temperature” (or any counter value for that matter) means the same thing across different hardware implementations. To print the PICL tree, you can use the prtpicl(1M) utiltiy:

$ prtpicl -v |more

  / (picl, 3800000001)
  :PICLVersion   1.100000 
  :_class        picl 
  :name  / 
     platform (upa, 3800000005)
      :PlatformGroup     sun4u 
      :OS-Version        Generic_118558-06 
      :OS-Release        5.9 
      :HostName  winnie 
      :OS-Name   SunOS 
      :device_type       upa 
      :banner-name       Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz) 
      :clock-frequency   0x4f64b50 
      :model     SUNW,375-0066 
      :energystar-v2
      :#size-cells       0x2 
      :breakpoint-trap   0x7f 
      :#address-cells    0x2 
      :#interrupt-cells  0x1 
      :interrupt-map-mask        00  00  00  3e  00  00  00  00  00  00  00  00 
      :interrupt-map    
 00  00  00  3c  00  00  00  00  00  00  00  00  f0  05  fa  24  80  00  00  33 
      :idprom   
 01  80  08  00  20  c0  f6  e9  00  00  00  00  c0  f6  e9  a9  00  00  00  00 
 00  00  00  00  00  00  00  00  00  00  00  00 
      :reset-reason      S-POR 
      :boot-retained-page
      :PAGESIZE  8192 
      :MMU_PAGESIZE      8192 
      :MMU_PAGEOFFSET    8191 
      :relative-addressing       1 
      :pm-hardware-state         no-suspend-resume 
      :devfs-path        / 
      :driver-name       rootnex 
      :binding-name      SUNW,Ultra-5_10 
      :bus-addr   
      :instance  -1 
      :PlatformName      SUNW,Ultra-5_10 
      :_class    upa 
      :name      platform 
         memory (memory, 38000000c4)
          :Size  0x40000000 
          :UnitAddress   0,0 
          :available    
     [ ... ]

I find myself using this all the time to find a wide variety of information. Nifty utility!

Making Solaris DHCP server IP addresses usable

While working through a few issues with our x86 jumpstart setup, I saw the following error messages when one of the clients was attempting to PXE boot:

429f62d0: Datagram received on network device: ge0(limited broadcast)
429f62d0: (01000D609C8888,1.2.3.4) currently marked as unusable.

To make the IP address usable again, I used the Solaris pntadm utility:

$ pntadm -M 1.2.3.4 -f PERMANENT 1.2.3.0

The “-M” option instructs pntadm to modify the entry, and the “-f PERMANENT” option tells pntadm to make this entry permanent. The final option is the macro associated with the network which the IP address/client is assigned to. To verify the command worked, we can use pntadm’s “-P” (print) option:

$ pntadm -P 1.2.3.0

Client ID       Flags  Client IP    Server IP    Lease  Expiration Macro    Comment
01000D609C8888  01     1.2.3.4      1.2.3.200    Forever           1.2.3.0  me

Once I adjusted the flags, everything started working. Jeah!