Summarizing system call activity on Solaris hosts

I previously described how to use strace to to summarize system call activity on Linux hosts. Solaris provides similar data with the truss “-c” option:

$ truss -c -p 26396
syscall               seconds   calls  errors
read                     .000       3
write                   7.671   25038
time                     .000      21
stat                     .000      15
lseek                    .460   24944
getpid                   .000      15
kill                     .162    7664
sigaction                .004     237
writev                   .000       3
lwp_sigmask              .897   49887
pollsys                  .476    7667
                     --------  ------   ----
sys totals:             9.674  115494      0
usr time:               2.250
elapsed:              180.940

The output contains the total elapsed time, a breakdown of user and system time, the number of errors that occurred, the number of times each system call was invoked and the total accrued time for each system call. This has numerous uses, and allows you to easily see how a process is intereacting with the kernel. Sweet!

Getting VMWare tools to configure on Solaris 10 guests

I built out a Solaris 10 update 10 guest on a vSphere 4.1 machine this past week. When I went to configure VMWare tools I received the following error:

$ /usr/bin/vmware-config-tools.pl

Initializing...

Stopping VMware Tools services in the virtual machine:
   Guest operating system daemon:                                      done
   Guest memory manager:                                               done


Unable to create symlink "/mnt/hgfs" pointing to file "/hgfs".

Execution aborted.

After doing some debugging I noticed that /mnt is used temporarily and if you happen to be using it (or if volfs is) the installer will fail. I temporarily stopped volfs and unmounted a device that was mounted on /mnt and the installer completed without issue. Figure I would share this with others so you don’t have to waste your time debugging this problem.

Ridding your Solaris host of zombie processes

We encountered a nasty bug in our backup software this week. When this bug is triggered, each job (one process is created per job) that completes will turn into a zombie. After a few days we will have hundreds or even thousands of zombie processes, which if left unchecked will eventually lead to the system-side process table filling up. Solaris comes with a nifty tool to help deal with zombies (no, they don’t ship you a shotgun with your media kit), and it comes by the name preap. To use preap, you can pass it the PID of the zombie process you want to reap:

$ ps -ef | grep defunct

    root   646   426   0        - ?           0:00 <defunct>
    root  1489 12335   0 09:32:54 pts/1       0:00 grep defunct

$ preap 646
646: exited with status 0

This will cause the process to exit, and the kernel can then free up the resources that were allocated by that process. On a related note, if you haven’t seen the movie zombieland you are missing out!!!! That movie is hilarious!

Solaris reporting multiple devices sharing IRQ assignments

One of my co-workers this week was fighting disk failure on a Solaris 10 x86 host.  I was checking /var/adm/messages and came across something interesting.

Apr 11 03:29:21 sinatra.fatkitty.com nge: [ID 801725 kern.info] NOTICE: nge1: Using FIXED interrupt type
Apr 11 03:29:21 sinatra.fatkitty.com unix: [ID 954099 kern.info] NOTICE: IRQ20 is being shared by drivers with different interrupt levels.
Apr 11 03:29:21 sinatra.fatkitty.com This may result in reduced system performance.
Apr 11 03:29:21 sinatra.fatkitty.com mac: [ID 469746 kern.info] NOTICE: nge1 registered

Weird.  So, x86 hardware assigns different IRQ assignments to devices.  This usually happens at the BIOS level. Anyways, I was kind of curious as to what devices were sharing IRQ20.  We can invoke the kernel debugger and run a dcmd to find this out.

root@sinatra:/var/adm#mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp ufs ip hook neti sctp arp usba fcp fctl lofs zfs random nfs md cpc fcip crypto logindmux ptm ]
> ::interrupts -d
IRQ  Vector IPL Bus   Type  CPU Share APIC/INT# Driver Name(s)
3    0xb0   12  ISA   Fixed 3   1     0x0/0x3   asy#1
4    0xb1   12  ISA   Fixed 3   1     0x0/0x4   asy#0
9    0x81   9   PCI   Fixed 1   1     0x0/0x9   acpi_wrapper_isr
16   0x60   6   PCI   Fixed 7   1     0x0/0x10  bge#0
17   0x62   6   PCI   Fixed 7   1     0x0/0x11  bge#1
20   0x63   6   PCI   Fixed 2   2     0x0/0x14  nge#1, nv_sata#0
21   0x20   1   PCI   Fixed 4   1     0x0/0x15  ehci#0
22   0x21   1   PCI   Fixed 5   1     0x0/0x16  ohci#0
23   0x61   6   PCI   Fixed 0   1     0x0/0x17  nge#0
24   0x82   7         MSI   6   1     –         pcie_pci#3
25   0x83   7         MSI   6   1     –         pcie_pci#3
160  0xa0   0         IPI   ALL 0     –         poke_cpu
192  0xc0   13        IPI   ALL 1     –         xc_serv
208  0xd0   14        IPI   ALL 1     –         kcpc_hw_overflow_intr
209  0xd1   14        IPI   ALL 1     –         cbe_fire
210  0xd3   14        IPI   ALL 1     –         cbe_fire
240  0xe0   15        IPI   ALL 1     –         xc_serv
241  0xe1   15        IPI   ALL 1     –         apic_error_intr

Bleh.  So, the SATA device and the nVidia gigabit NIC in port 1 are both sharing IRQ20.    We’re using nge1 on this host for IPMP.  Usually, we just stick with the Broadcom NICs, but, this is one of those one-off cases.  We wanted to use IPmP over the different chipsets to maximize redundancy on the host.
Anywho, the output above shows IRQ20 bound to CPU2, so we can poke at it with intrstat to see the interrupt activity happening on that processor.  Here, I invoke interstat to just output CPU2 over 10s intervals.

root@sinatra:/var/adm#intrstat -c 2 10
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       101  1.0
nv_sata#0 |       101  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       158  1.5
nv_sata#0 |       158  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |        99  1.0
nv_sata#0 |        99  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       108  1.0
nv_sata#0 |       108  0.0
ohci#0 |         0  0.0
device |      cpu2 %tim
————-+—————
asy#1 |         0  0.0
bge#0 |         0  0.0
ehci#0 |         0  0.0
nge#0 |         0  0.0
nge#1 |       160  1.5
nv_sata#0 |       160  0.0
ohci#0 |         0  0.0
^C

So, its really minimal interrupt activity going on here. The application on this host is completely based on ramdisk, and the primary IPmP NIC is on bge0, so both devices really aren’t competing for resources.  But, if nge1 was super busy and so was disk, this would be a bottleneck.  Anyways, I thought it was interesting.  I learned a new mdb dcmd and had one of those rare use cases for poking at intrstat.

Fixing Solaris hosts that boot to a grub> prompt

I applied the latest recommended patch bundle this week to two X4140 servers running Solaris 10. When I rebooted, I was greeted with a grub> prompt instead of the grub menu:

grub>

This wasn’t so good, and for some reason the stage1 / stage2 loaders weren’t installed correctly (or the zpool upgrade caused some issues). To fix this issue, I booted to single user mode by inserting a Solaris 10 update 8 CD and adding “console=ttya -s” to the end of the boot line. Once my box booted, I ran ‘zpool status’ to verify my pool was available:

$ zpool status

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

To re-install the grub stage1 and stage2 loaders, I ran installgrub (you can get the device to use from ‘zpool status’):

$ /sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t0d0s0

stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 272 sectors starting at 50 (abs 16115)

To ensure that the boot archive was up to date, I ran ‘bootadm update-archive’:

$ bootadm update-archive -f -R /a

Creating boot_archive for /a
updating /a/platform/i86pc/boot_archive

Once these changes were made, I init 6’ed the system and it booted successfully. I’ve created quite a grub cheat sheet over the years (this made recovery a snap), and will post it here once I get it cleaned up.

Figuring out what a hung Solaris process is doing inside the kernel

I had a a process hang last week on one of my Solaris hosts, and was curious what each thread was doing. The mdb utility is perfect for locating this information, since you an combine pid2proc with the walk and findstack dcmds to get the call stack of each thread in a process (in the example below, I am examining process id 48):

$ mdb -k

> 0t48::pid2proc |::walk thread |::findstack -v
stack pointer for thread ffffff02d6fd1080: ffffff000fac1df0
[ ffffff000fac1df0 _resume_from_idle+0xf1() ]
  ffffff000fac1e20 swtch+0x147()
  ffffff000fac1e80 cv_wait_sig_swap_core+0x170(ffffff02d6fd1256, ffffff02d6fd1258, 0)
  ffffff000fac1ea0 cv_wait_sig_swap+0x18(ffffff02d6fd1256, ffffff02d6fd1258)
  ffffff000fac1ec0 pause+0x48()
  ffffff000fac1f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6fccc00: ffffff0011184d10
[ ffffff0011184d10 _resume_from_idle+0xf1() ]
  ffffff0011184d50 swtch_to+0xe5(ffffff02eae948e0)
  ffffff0011184db0 shuttle_resume+0x328(ffffff02eae948e0, ffffffffc00c2ed0)
  ffffff0011184e50 door_return+0x21a(fedae9f8, 408, 0, 0, fedaee00, f5f00)
  ffffff0011184ec0 doorfs32+0x134(fedae9f8, 408, 0, fedaee00, f5f00, a)
  ffffff0011184f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6b18c20: ffffff000ff00d30
[ ffffff000ff00d30 _resume_from_idle+0xf1() ]
  ffffff000ff00d60 swtch+0x147()
  ffffff000ff00db0 shuttle_swtch+0x259(ffffffffc00c2ed0)
  ffffff000ff00e50 door_return+0x242(0, 0, 0, 0, fec9ee00, f5f00)
  ffffff000ff00ec0 doorfs32+0x134(0, 0, 0, fec9ee00, f5f00, a)
  ffffff000ff00f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6abb400: ffffff000f9dab30
[ ffffff000f9dab30 _resume_from_idle+0xf1() ]
  ffffff000f9dab60 swtch+0x147()
  ffffff000f9dabe0 cv_timedwait_sig_internal+0x1d6(ffffff02d60a30dc, ffffff02d60a30e0, fb24, 0)
  ffffff000f9dac10 cv_timedwait_sig+0x1f(ffffff02d60a30dc, ffffff02d60a30e0, fb24)
  ffffff000f9dac40 kcf_svc_wait+0x86(ffffff000f9dac5c)
  ffffff000f9dacc0 cryptoadm_ioctl+0xe0(9a00000000, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4)
  ffffff000f9dad00 cdev_ioctl+0x45(9a00000000, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4)
  ffffff000f9dad40 spec_ioctl+0x83(ffffff02d68eca00, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4, 0)                
  ffffff000f9dadc0 fop_ioctl+0x7b(ffffff02d68eca00, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4, 0)
  ffffff000f9daec0 ioctl+0x18e(3, 790d, feb91fc0)
  ffffff000f9daf10 sys_syscall32+0x101()
stack pointer for thread ffffff02d69fa540: ffffff000f9c1b30
[ ffffff000f9c1b30 _resume_from_idle+0xf1() ]
  ffffff000f9c1b60 swtch+0x147()
  ffffff000f9c1be0 cv_timedwait_sig_internal+0x1d6(ffffff02d616e5e0, ffffff02d616e5e8, fb24, 0)
  ffffff000f9c1c10 cv_timedwait_sig+0x1f(ffffff02d616e5e0, ffffff02d616e5e8, fb24)
  ffffff000f9c1c40 kcf_svc_do_run+0x7d()
  ffffff000f9c1cc0 cryptoadm_ioctl+0x9b(9a00000000, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4)
  ffffff000f9c1d00 cdev_ioctl+0x45(9a00000000, 790e, 8, 100003, ffffff02d5747248, ffffff000f9c1de4)
  ffffff000f9c1d40 spec_ioctl+0x83(ffffff02d68eca00, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4, 0)
  ffffff000f9c1dc0 fop_ioctl+0x7b(ffffff02d68eca00, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4, 0)
  ffffff000f9c1ec0 ioctl+0x18e(3, 790e, 8)
  ffffff000f9c1f10 sys_syscall32+0x101()



It turns out the issue I encountered was due to a bug, which will hopefully be fixed in the near future.