Breaking down system time usage in the Solaris kernel


I am frequently asked (or paged) to review system performance issues on our Solaris 10 hosts. I use the typical set of Solaris performance tools to observe what my systems are doing, and start drilling down once I know if the problem is with userland applications or in the kernel itself. When I observe issues with the Solaris kernel (these are typically represented by high system time values in vmstat), the first thing I typically do is fire up lockstat to see where the kernel is spending its time:

$ lockstat -kIW /bin/sleep 5


Profiling interrupt: 31424 events in 5.059 seconds (6212 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PIL Caller
-------------------------------------------------------------------------------
28962 92% 92% 0.00 3765 cpu[61] cpu_halt
1238 4% 96% 0.00 3747 cpu[22] lzjb_compress
165 1% 97% 0.00 2655 cpu[37] copy_pattern
124 0% 97% 0.00 2849 cpu[55] copyout
89 0% 97% 0.00 3682 cpu[63] fletcher_4_native
49 0% 97% 0.00 2565 cpu[37] copyin
45 0% 98% 0.00 3079 cpu[0] tcp_rput_data
39 0% 98% 0.00 3597 cpu[0] mutex_enter
30 0% 98% 0.00 3692 cpu[0] nxge_start
28 0% 98% 0.00 3701 cpu[59]+6 nxge_receive_packet
28 0% 98% 0.00 2935 cpu[0] disp_getwork
25 0% 98% 0.00 3110 cpu[0] bcopy

If I see a function that stands out from the rest, I will use the lockstat ‘-f” option to drill down by the kernel function with that name, and use the “-s” option to print the call stack leading up this function:

$ lockstat -kIW -f lzjb_compress -s5 /bin/sleep 5


Profiling interrupt: 703 events in 2.058 seconds (342 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest CPU+PIL Caller
130 18% 18% 0.00 3625 cpu[28] lzjb_compress

nsec ------ Time Distribution ------ count Stack
2048 | 2
4096 |@@@@@@@@@@@@@@@@@@@@@@@@ 107
8192 |@@@@ 21
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest CPU+PIL Caller
20 3% 21% 0.00 3529 cpu[37] lzjb_compress

nsec ------ Time Distribution ------ count Stack
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18 0x74
8192 |@@@ 2 zio_compress_data
zio_write_bp_init
zio_execute
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest CPU+PIL Caller
19 3% 24% 0.00 3696 cpu[28] lzjb_compress

nsec ------ Time Distribution ------ count Stack
4096 |@@@@@@@@@@@@@@@@@@@@@@ 14 0x50
8192 |@@@@@@@ 5 zio_compress_data
zio_write_bp_init
zio_execute
-------------------------------------------------------------------------------

I use lockstat quite a bit to observe what the kernel is doing, and to help me figure out where I should look for answers in the opensolaris source code. It’s also useful for determining if you are encountering a kernel bug, since you can compare the backtrace returned from lockstat with the OpenSolaris bug databases.

This article was posted by Matty on 2010-03-08 19:55:00 -0400 -0400