With the recent release of the Solaris 10 Operating System, Sun unleashed the Dynamic Tracing Facility (DTrace) on the world. Dynamic tracing can be used by administrators, developers, and quality engineering to track down reliability and performance problems, correlate disparate events, and observe applications and the Solaris kernel in ways that have never before been possible.
The DTrace framework works by integrating several providers into the Solaris kernel. Each provider contains a set of probes, which are locations in the kernel where actions (e.g., capture a stack trace, record an argument to a function, etc.) can be applied when the probe is triggered. Probes are enabled programmatically through libdtrace or through the dtrace(1m) command-line utility, which is preferred.
DTrace comes bundled with a language called “D”. D is a scripting language that is well-suited for developing tools to observe application resources and system behavior. The DTraceToolkit is a collection of D scripts that includes tools to view the virtual memory subsystem, network utilization, CPU activity, locking behavior, and the I/O subsystem. In this article, I’ll introduce the DTraceToolkit and demonstrate how to use the Toolkit to analyze I/O behavior on aSolaris server.
The DTraceToolkit is available as a compressed tar archive and can be installed to your favorite directory location with the wget and gtar utilities:
$ wget -q -O - http://www.brendangregg.com/ DTraceToolkit-latest.tar.gz | gtar xz
All of the scripts in the DTraceToolkit are organized into directories that identify their purpose (e.g., Disk, Memory, etc.) and are linked into a Bin directory for easy access:
$ ls -la
total 304
drwxr-xr-x 34 matty matty 1156 Jul 26 01:51 .
drwx------ 12 matty matty 408 Aug 7 21:23 ..
drwxr-xr-x 84 matty matty 2856 Jul 26 01:43 Bin
drwxr-xr-x 9 matty matty 306 Jun 28 00:24 Cpu
drwxr-xr-x 9 matty matty 306 Jul 25 06:40 Disk
drwxr-xr-x 16 matty matty 544 Jul 26 01:44 Docs
drwxr-xr-x 3 matty matty 102 Jun 11 08:52 Extra
-rw-r--r-- 1 matty matty 2392 Jul 26 01:43 Guide
drwxr-xr-x 9 matty matty 306 Jun 13 15:32 Kernel
lrwxr-xr-x 1 matty matty 14 Aug 7 21:23 License -> Docs/cddl1.txt
drwxr-xr-x 4 matty matty 136 May 16 03:32 Locks
drwxr-xr-x 3 matty matty 102 Jun 13 08:36 Man
drwxr-xr-x 12 matty matty 408 Jul 25 02:57 Mem
drwxr-xr-x 10 matty matty 340 Jul 25 09:06 Net
drwxr-xr-x 25 matty matty 850 Jul 25 13:36 Proc
lrwxr-xr-x 1 matty matty 5 Aug 7 21:23 README -> Guide
drwxr-xr-x 7 matty matty 238 Jul 25 11:05 System
drwxr-xr-x 4 matty matty 136 May 15 04:16 User
-rw-r--r-- 1 matty matty 40 Jul 26 01:51 Version
drwxr-xr-x 4 matty matty 136 Jul 26 01:41 Zones
-rwxr-xr-x 1 matty matty 7143 Jul 23 07:51 dappprof
-rwxr-xr-x 1 matty matty 7626 Jul 23 07:52 dapptrace
-rwxr-xr-x 1 matty matty 13576 Jul 23 07:56 dexplorer
-rwxr-xr-x 1 matty matty 13461 Jul 23 08:02 dtruss
-rwxr-xr-x 1 matty matty 6224 Jul 23 08:03 dvmstat
-rwxr-xr-x 1 matty matty 4356 Jul 23 08:04 errinfo
-rwxr-xr-x 1 matty matty 4883 Jul 23 08:05 execsnoop
-rwxr-xr-x 1 matty matty 4275 May 15 01:03 install
-rwxr-xr-x 1 matty matty 11983 Jul 25 07:26 iosnoop
-rwxr-xr-x 1 matty matty 10954 Jul 26 00:56 iotop
-rwxr-xr-x 1 matty matty 6189 Jul 23 08:34 opensnoop
-rwxr-xr-x 1 matty matty 6043 Jul 23 08:49 procsystime
-rwxr-xr-x 1 matty matty 5483 Jul 23 21:46 rwsnoop
-rwxr-xr-x 1 matty matty 6514 Jul 23 14:51 rwtop
A large number of the scripts contain a “-h” (help) option, which prints a short description of the available flags and options:
$ iopattern -h
USAGE: iopattern [-v] [-d device] [-f filename] [-m mount_point]
[interval [count]]
-v # print timestamp
-d device # instance name to snoop
-f filename # snoop this file only
-m mount_point # this FS only
eg,
iopattern # default output, 1 second samples
iopattern 10 # 10 second samples
iopattern 5 12 # print 12 x 5 second samples
iopattern -m / # snoop events on filesystem / only
The README file in the top-level directory provides a description of each directory, and the Docs/Contents file contains a description of each script. Now that we know how the toolkit is laid out, let’s put this bad boy to work!
One of the biggest challenges we face as systems administrators is identifying I/O performance problems. I/O problems are often caused by improper tuning, over-utilized disk resources, and memory shortages. When I/O performance problems surface, they usually lead to application latency and reduced database thr oughput. The DTraceToolkit comes with the iosnoop script, which can be used to monitor physical I/O operations, measure I/O latency, and correlate physical I/O activity to individual processes:
$ iosnoop -e
DEVICE UID PID D BLOCK SIZE COMM PATHNAME
dad1 100 522 R 2821088 8192 bash /usr/bin/su
md0 100 522 R 2821088 8192 bash /usr/bin/su
dad0 100 522 R 332144 8192 su /lib/libbsm.so.1
md0 100 522 R 332144 8192 su /lib/libbsm.so.1
dad1 100 522 R 288484 6144 su /lib/libmd5.so.1
md0 100 522 R 288484 6144 su /lib/libmd5.so.1
dad1 100 522 R 334944 8192 su /lib/libmp.so.2
md0 100 522 R 334944 8192 su /lib/libmp.so.2
dad1 25 527 R 10114400 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114400 8192 sendmail /usr/lib/sendmail
dad0 25 527 R 10114160 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114160 8192 sendmail /usr/lib/sendmail
dad1 25 527 R 10114448 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114448 8192 sendmail /usr/lib/sendmail
When the “-e” (print device name) is passed to iosnoop, the device name to which the I/O operation is targeted will be printed in column 1. The second and third columns contain the uid and process id of the process that issued the read or write operation. The fourth column contains an “R” for a read operation or a “W” for a write operation. The fifth column displays the starting logical block in the file system where data is being read or written, and the sixth column displays the size of the read or write operation. The seventh column displays the name of the executable that performed the read or write operation, and the last column contains the file name that is being read or written from.
When iosnoop is invoked with the “-o” (print disk delta time, us) option, the number of microseconds required to service each physical read and write operation will be displayed in the “DTIME” column:
$ iosnoop -e -o
DEVICE DTIME UID PID D BLOCK SIZE COMM PATHNAME
sd5 5555 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
sd6 8618 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
md120 10967 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
sd1 12300 101 460 R 14771248 8192 oracle /u01/app/oracle/product/10.0.2.0/bin/oracle
sd2 12712 101 460 R 14771248 8192 oracle /u01/app/oracle/product/10.0.2.0/bin/oracle
sd3 9229 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
sd4 12442 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
md110 13751 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
The iosnoop script will continuously display physical read and write operations for all processes on a system, which can often overwhelm a terminal screen on a busy server. To limit the output to a single process (possibly one that is causing problems), the process name can be passed to iosnoop’s “-n” (this process name only) option:
$ iosnoop -a -n sshd
STRTIME DEVICE MAJ MIN UID PID D BLOCK SIZE PATHNAME ARGS
2007 Jun 23 15:52:21 sd1 32 67 0 10945 W 52464 4096 /var/adm/lastlog /usr/lib/ssh/sshd\0
2007 Jun 23 15:52:21 sd1 32 67 0 10945 W 52464 4096 /var/adm/lastlog /usr/lib/ssh/sshd\0
2007 Jun 23 15:52:22 sd1 32 67 0 10945 W 26080 4096 /var/adm/utmpx /usr/lib/ssh/sshd\0
The continuously scrolling display is useful for capturing data over long periods of time. When a point-in-time snapshot is required, the DTraceToolkit’s iotop script can be used to view physical I/O activity in a “top”-like display:
$ iotop 2
2005 Aug 13 11:36:43, load: 0.16, disk_r: 96 Kb, disk_w: 109890 Kb
UID PID PPID CMD DEVICE MAJ MIN D BYTES
0 0 0 sched dad0 136 7 W 512
0 3 0 fsflush dad0 136 7 W 512
0 3 0 fsflush dad1 136 15 W 512
0 0 0 sched dad1 136 15 W 512
0 3 0 fsflush dad1 136 8 W 3072
0 3 0 fsflush md0 85 0 W 3072
0 3 0 fsflush dad0 136 0 W 3072
100 502 484 mkfile dad0 136 7 W 3584
100 502 484 mkfile dad1 136 15 W 3584
0 501 498 dtrace dad1 136 8 R 8192
0 501 498 dtrace dad0 136 0 R 8192
0 501 498 dtrace md0 85 0 R 16384
100 502 484 mkfile dad1 136 8 R 16384
100 502 484 mkfile dad0 136 0 R 16384
100 502 484 mkfile md0 85 0 R 32768
100 502 484 mkfile md0 85 0 W 34881536
100 502 484 mkfile dad1 136 8 W 34881536
100 502 484 mkfile dad0 136 0 W 34881536
The output of iotop will be refreshed every five seconds by default. To adjust this interval, a positive integer value can be passed as an option to iotop . Because the iosnoop and iotop scripts were designed to monitor physical I/O activity, reads serviced by the page cache and operations targeted at pseudo-devices will not be displayed by iosnoop and iotop. The section “Monitoring Read and Write Activity” will describe two additional scripts that can be used to provide a holistic view of read and write activity, which can be useful for viewing accesses to pseudo-devices and cached data.
When a process needs to read from or write to a file or device, the process will invoke one of the kernel’s read or write system calls. The read or write may occur to a file, pseudo-device, or to one of the physical devices attached to the system. The rwsnoop script can capture read and write activity, which can be helpful for monitoring I/O activity when file system caches are in use and to view accesses to pseudo-devices and virtual file systems that reside in memory. To view read and write operations along with the process that generated the request, the rwsnoop script can be invoked:
$ rwsnoop
UID PID CMD D BYTES FILE
101 379 oracle W 16384 /u01/oradata/proddb/control01.ctl
101 379 oracle W 16384 /u01/oradata/proddb/control02.ctl
101 379 oracle W 16384 /u01/oradata/proddb/control03.ctl
101 460 oracle W 1048576 /u02/oradata/proddb/tbs01.dbf
101 460 oracle W 1048576 /u02/oradata/proddb/rbs01.dbf
101 460 oracle W 1048576 /u02/oradata/proddb/tbs01.dbf
101 377 oracle W 1024 /u03/oradata/proddb/redo03.log
101 377 oracle W 1024 /u03/oradata/proddb/redo03.log
101 371 oracle R 416 /proc/371/psinfo
101 371 oracle R 416 /proc/373/psinfo
101 371 oracle R 416 /proc/375/psinfo
Rwsnoop will display which process is performing the read or write operation, the size and type (read or write) of that operation, and the file or device the process is reading or writing to. Like iosnoop, the output will be continuously displayed on the terminal screen. To view a summary of read and write activity in a “top”-like display, the rwtop script can be used:
$ rwtop 2
2005 Jul 24 10:47:26, load: 0.18, app_r: 9 Kb, app_w: 8 Kb
UID PID PPID CMD D BYTES
100 922 920 bash R 3
100 922 920 bash W 15
100 902 899 sshd R 1223
100 926 922 ls R 1267
100 902 899 sshd W 1344
100 926 922 ls W 2742
100 920 917 sshd R 2946
100 920 917 sshd W 4819
0 404 1 vxsvc R 5120
The screen will be refreshed every five seconds by default, which can be adjusted by passing a new refresh interval to rwtop.
As systems administrators, we are often asked to identify the size of I/O operations issued by each application. This information is used by systems and application administrators to tune file system block and cluster sizes, SCSI and LVM transfer sizes, and RAID 0 stripe widths to match application I/O patterns.
Before the introduction of the DTraceToolkit, the vxtrace(1m), prex(1m), and truss(1m) utilities were the best methods to find this information. The output from these utilities was often extremely verbose, and the process of correlating data and determining I/O size distributions was cumbersome and prone to error.
Now that the DTraceToolkit is available, this information can be easily determined by running the bitesize.d script. bitesize.d will determine I/O size distributions for all processes in a Solaris system and will print a histogram with the number of I/O operations displayed by size:
$ bitesize.d
Sampling... Hit Ctrl-C to end.
377 ora_lgwr_proddb
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@ 604
1024 | 2
2048 | 0
4096 | 2
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@ 1218
16384 |@@ 102
32768 | 0
65536 | 0
131072 | 12
262144 | 0
7312 dbload pages=128
value ------------- Distribution ------------- count
16 | 0
32 | 2
64 | 0
128 | 0
256 | 0
512 | 2
1024 | 0
2048 | 0
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 76947
262144 | 0
This display shows the output from the Oracle logwriter and a database load process. The histogram shows that the Oracle log writer was using predominantly 8k-16k I/O sizes. The database loader, on the other hand, was requesting 1-MB chunks (128 pages
When deploying new applications to Solaris systems, it is essential to understand how an application accesses data. Access patterns can be random or sequential and often vary in size. Understanding access patterns can help when tuning direct I/O and caching attributes and are invaluable for finding the optimal block sizes for applications.
Before the introduction of DTrace, an administrator was required to discuss access patterns with vendors and development teams or parse through mounds of truss(1m) and vxtrace(1m) data to estimate how applications access files and volumes. The data was often skewed, required a good deal of time to derive interesting metrics, and was often prone to error.
Now that DTrace is able to capture data throughout the kernel, the job of finding access patterns has been greatly simplified. The DTraceToolkit includes the iopattern and seeksize.d scripts, which can both be used to display per-process and system-wide access patterns. To get system-wide access patterns, the iopattern script can be used:
$ iopattern
%RAN %SEQ COUNT MIN MAX AVG KR KW
95 5 101 512 131072 30649 2792 231
60 40 174 512 131072 61557 951 9509
58 42 187 512 131072 35369 2232 4227
53 47 229 512 131072 65880 3992 10741
42 58 170 512 131072 89798 5328 9580
67 33 190 512 131072 52135 3864 5809
80 20 195 512 131072 6516 1207 34
84 16 224 512 24576 7860 1574 145
89 11 199 1024 8192 7682 1481 12
iopattern provides the percentage of random and sequential I/O, the number and total size of the I/O operations performed during this sample period, and it provides the minimum, maximum, and average I/O sizes. To get the I/O distribution for each process, the seeksize.d script can be used:
$ seeksize.d
Sampling... Hit Ctrl-C to end.
^C
PID CMD
7312 dd if=/dev/dsk/c1t1d0s2 of=/dev/null bs=1048576
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1762
1 | 0
PID CMD
377 ora_dbw0_proddb
value ------------- Distribution ------------- count
-8388608 | 0
-4194304 | 3
-2097152 | 28
-1048576 | 44
-524288 | 95
-262144 | 102
-131072 | 166
-65536 |@ 363
-32768 |@@ 585
-16384 |@ 328
-8192 | 141
-4096 | 105
-2048 | 57
-1024 | 35
-512 | 34
-256 | 40
-128 | 31
-64 | 35
-32 | 47
-16 | 22
-8 | 6
-4 | 20
-2 | 0
-1 | 0
0 |@@@@@@@@@@@@ 4164
1 | 0
2 |@@@ 926
4 |@ 208
8 |@@@@@@ 2148
16 |@ 440
32 |@ 410
64 |@ 318
128 |@ 225
256 |@ 190
512 | 148
1024 | 139
2048 | 155
4096 | 168
8192 |@ 183
16384 |@ 377
32768 |@@ 657
65536 |@ 406
131072 |@ 199
262144 | 151
524288 | 111
1048576 | 37
2097152 | 14
4194304 | 3
8388608 | 0
Seeksize.d measures the seek distance between consecutive reads and writes and provides a histogram with the number of sectors traveled between consecutive read or write operations. Applications that read or write data in a sequential fashion will contain a small distribution (e.g., dd in the example above). Applications that read or write data in a random fashion will cause the disk heads to seek between read and write operations, which will cause a wide distribution (e.g., the Oracle database writer in the example above) in the seeksize.d output.
With the continued growth in storage capacity and complexity, it is important to have tools to analyze and debug I/O problems. This article introduced several scripts in the DTraceToolkit that can be used to easily observe I/O behavior in ways that were previously difficult or impossible. Because Solaris 10, DTrace, and the DTraceToolkit are free to download and use, truly understanding your applications is just a few clicks away. If you have questions or comments on the article, please feel free to E-mail the author.
The following references were used while writing this article:
Ryan thanks Brendan Gregg for reviewing this article and writing the DTraceToolkit. Ryan would also like to thank the DTrace developers for their awesome contribution to system and application observation.
Originally published in the December ‘05 issue of SysAdmin Magazine