When direct I/O attacks!


I was recently asked by one of my colleagues to investigate a slow disk subsystem on one of our servers. The administrator was getting horrible I/O throughput when unzipping files, and the Oracle installer was erroring out when attempting to perform a database upgrade. To get a better idea of just how bad the situation was, I decided to run iostat while he unzipped the file:

$ iostat -zxn 5

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 43.8 0.0 419.0 0.0 0.3 0.0 6.5 0 28 c4t18d0
0.0 26.4 0.0 393.9 0.0 0.2 0.0 6.0 0 16 c4t17d0
0.8 36.0 6.4 460.4 0.0 0.3 0.0 7.2 0 25 c4t19d0
21.2 43.8 169.6 419.0 0.0 0.4 0.0 5.4 0 35 c5t2d0
20.4 36.0 163.2 460.4 0.0 0.3 0.0 5.5 0 30 c5t3d0
20.8 26.4 166.4 393.9 0.0 0.2 0.0 4.1 0 19 c5t1d0

The iostat data shows that we are reading and writing at most 620k of data, and the average I/O size is around 12k. For some reason I found this amusing, so I put my current task on hold and started to debug the problem (I stayed late to finish that task though). To start debugging the issue, I decided to time the unzip operation on the problematic RAID 1+0 volume (which was using VxFS):

$ date ; unzip -q fj_9_Recommended.zip ; date
Fri Jan 20 15:04:22 EST 2006 Fri Jan 20 15:16:34 EST 2006

Ouch – it took 12-minutes to unzip a 200MB file, which is DEFINITELY not normal. To see how “off” we were, I unzipped the same file on a RAID 1 volume with two mirrored disks (also using VxFS):

$ date ; unzip -q fj_9_Recommended.zip ; date
Fri Jan 20 16:43:22 EST 2006 Fri Jan 20 16:44:54 EST 2006

This operation completed in just over a minute, and with far fewer disks. I attributed this speed increase to the Solaris page cache, and decided to run vxtrace to get a better idea of which I/O operations were generated as part of the unzip operation:

$ vxtrace -g oradg -l -o disk | grep START

309986 394877809 START write disk c5t1d0s2 op 0 block 20517834 len 2
309987 394877809 START write disk c4t17d0s2 op 0 block 20517834 len 2
309988 394877809 START read disk c5t1d0s2 op 0 block 3487120 len 16
309989 394877810 START write disk c5t1d0s2 op 0 block 2567 len 5
309990 394877810 START write disk c4t17d0s2 op 0 block 2567 len 5
309991 394877811 START read disk c5t1d0s2 op 0 block 3487136 len 16
309992 394877812 START write disk c5t1d0s2 op 0 block 20517836 len 12
309993 394877812 START write disk c4t17d0s2 op 0 block 20517836 len 12
309994 394877813 START read disk c5t1d0s2 op 0 block 3487152 len 16
309995 394877814 START write disk c5t1d0s2 op 0 block 2572 len 5
309996 394877814 START write disk c4t17d0s2 op 0 block 2572 len 5
309997 394877815 START read disk c5t1d0s2 op 0 block 3487168 len 16
309998 394877816 START write disk c5t1d0s2 op 0 block 20517848 len 4
309999 394877816 START write disk c4t17d0s2 op 0 block 20517848 len 4
310000 394877816 START read disk c5t1d0s2 op 0 block 3487184 len 16

Well that’s odd, it looks like there are a bunch of small writes and numerous 16k reads. Since the box had plenty of free memory, and the previosu unzip operation had completed successfully, I started to wonder why any reads were being issued (they should be in the Solaris page cache). To get another frame of reference for comparison, I decided to run vxtrace while I unzipped the file on the RAID 1 volume:

$ vxtrace -g rootdg -l -o disk | grep START |more

385161 395719986 START write disk c5t0d0s2 op 0 block 19756 len 128
385162 395719986 START write disk c4t16d0s2 op 0 block 19756 len 128
385163 395719987 START write disk c5t0d0s2 op 0 block 25868032 len 64
385164 395719987 START write disk c4t16d0s2 op 0 block 25868032 len 64
385165 395719987 START write disk c5t0d0s2 op 0 block 25832064 len 64
385166 395719987 START write disk c4t16d0s2 op 0 block 25832064 len 64
385167 395719987 START write disk c5t0d0s2 op 0 block 25832128 len 128
385168 395719987 START write disk c4t16d0s2 op 0 block 25832128 len 128
385169 395719987 START write disk c5t0d0s2 op 0 block 25832256 len 128
385170 395719987 START write disk c4t16d0s2 op 0 block 25832256 len 128
385171 395719988 START write disk c5t0d0s2 op 0 block 25832384 len 128
385172 395719988 START write disk c4t16d0s2 op 0 block 25832384 len 128
385173 395719988 START write disk c5t0d0s2 op 0 block 25832512 len 128
385174 395719988 START write disk c4t16d0s2 op 0 block 25832512 len 128
385175 395719988 START write disk c5t0d0s2 op 0 block 25832640 len 128
385176 395719988 START write disk c4t16d0s2 op 0 block 25832640 len 128

As you can see from the vxtrace output, the unzip operation caused zero reads and numerous 128k writes. Once I reviewed the data, I began to speculate that the file system was at fault, and decided to run fstyp and mount to view the mount options and file system layout:

$ mount | grep u02

/u02 on /dev/vx/dsk/oradg/oradgvol01 read/write/setuid/mincache=direct/convosync=direct/delaylog/largefiles/ioerror=mwdisable/dev=3bd4ff0 on Mon Dec 5 22:21:31 2005

$ fstyp -v /dev/vx/dsk/oradg/oradgvol01 |more

vxfs
magic a501fcf5 version 5 ctime Mon Apr 25 16:46:13 2005
logstart 0 logend 0
bsize 1024 size 320046080 dsize 320046080 ninode 0 nau 0
defiextsize 0 ilbsize 0 immedlen 96 ndaddr 10
aufirst 0 emap 0 imap 0 iextop 0 istart 0
bstart 0 femap 0 fimap 0 fiextop 0 fistart 0 fbstart 0
nindir 2048 aulen 32768 auimlen 0 auemlen 8
auilen 0 aupad 0 aublocks 32768 maxtier 15
inopb 4 inopau 0 ndiripau 0 iaddrlen 8 bshift 10
inoshift 2 bmask fffffc00 boffmask 3ff checksum da5e4db
oltext1 35 oltext2 16642 oltsize 1 checksum2 422d
free 286678607 ifree 0
efree 5 3 11 9 3 1 0 1 1 0 1 1 1 0 1 26 17 12 8 4 4 1 1 0 2 1 1 1 0 0 0 0

Well there we go. It looks like the volume was mounted with the VxFS “mincache=direct” and “convosync=direct” options, which are used to enable direct I/O with VxFS file systems. Since this volume was primarily used to process files, direct I/O was causing a physical I/O operation to be issued for each read() and write() . Since direct I/O has the side effect of disabling write clustering and file system read-ahead, I knew immediately that this was the problem. I ended up fixing the issue by remounting the file system without the “mincache” and “convosync” options. After the remount completed, I once again unzipped the file on the RAID 1+0 volume, and the operation took all of 30-seconds (thank you read-ahead and write clustering) to complete. I love debugging!

This article was posted by Matty on 2006-02-08 23:29:00 -0400 -0400