Problems growing RAID6 MD devices on RHEL5 systems

I attempted to grow an existing RAID6 MD device this week, and ran into the following error when I performed the grow operation:

$ mdadm –grow –raid-devices=5 –backup-file=/tmp/mdadmgrow.tmp /dev/md0
mdadm: Need to backup 384K of critical section..
mdadm: Cannot set device size/shape for /dev/md0: Invalid argument

It appears the ability to grow a RAID6 device was added in the Linux 2.6.21 kernel, and this feature has yet to be backported to RHEL5 (the mdadm manual page implies that this should work, so I reckon there is a documentation mismatch). If you are enountering this error, you will need to switch to a newer kernel in order to be able to grow RAID6 devices on RHEL5 systems. 2.6.33 worked like a champ, and I hope this issue is addressed when RHEL6 ships.

Debugging syslog-ng problems

While debugging the syslog-ng issue I mentioned previously, I needed to be able to observe the syslog-ng pattern matches as they occurred. The syslog-ng daemon has a couple of useful options to assist with this. The first is the “-e” option, which causes the daemon to log to stdout. The second is the “-F” option, which stops the daemon from forking. When you combine these option with the “-d” (debug) and “-v” (verbose) options, syslog-ng will print each log message it receives along with the rule processing logic that is applied to that rule:

$ /opt/syslog-ng/sbin/syslog-ng -e -F -d -v > /tmp/syslog-ng.out 2>&1

Incoming log entry; line='<85>sshd2[382]: Public key /root/.ssh/ used.\x0a'
Filter rule evaluation begins; filter_rule='f_web_hosts'
Filter node evaluation result; filter_result='match', filter_type='level'
Filter node evaluation result; filter_result='not-match'
Filter node evaluation result; filter_result='not-match'
Filter node evaluation result; filter_result='not-match', filter_type='OR'
Filter node evaluation result; filter_result='not-match', filter_type='AND'
Filter rule evaluation result; filter_result='not-match', filter_rule='f_web_hosts'
Filter rule evaluation begins; filter_rule='f_app_hosts'

When a syslog message matches a given rule, you will see the filter_result string change from not-match to match:

Filter rule evaluation result; filter_result=’match’, filter_rule=’f_db_hosts’

Syslog-ng is pretty sweet, and you can check out my centralized logging presentation if you are interested in learning more about how this awesome piece of software works!

Breaking down system time usage in the Solaris kernel

I am frequently asked (or paged) to review system performance issues on our Solaris 10 hosts. I use the typical set of Solaris performance tools to observe what my systems are doing, and start drilling down once I know if the problem is with userland applications or in the kernel itself. When I observe issues with the Solaris kernel (these are typically represented by high system time values in vmstat), the first thing I typically do is fire up lockstat to see where the kernel is spending its time:

$ lockstat -kIW /bin/sleep 5

Profiling interrupt: 31424 events in 5.059 seconds (6212 events/sec)

Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller                  
28962  92%  92% 0.00     3765 cpu[61]                cpu_halt                
 1238   4%  96% 0.00     3747 cpu[22]                lzjb_compress           
  165   1%  97% 0.00     2655 cpu[37]                copy_pattern            
  124   0%  97% 0.00     2849 cpu[55]                copyout                 
   89   0%  97% 0.00     3682 cpu[63]                fletcher_4_native       
   49   0%  97% 0.00     2565 cpu[37]                copyin                  
   45   0%  98% 0.00     3079 cpu[0]                 tcp_rput_data           
   39   0%  98% 0.00     3597 cpu[0]                 mutex_enter             
   30   0%  98% 0.00     3692 cpu[0]                 nxge_start              
   28   0%  98% 0.00     3701 cpu[59]+6              nxge_receive_packet     
   28   0%  98% 0.00     2935 cpu[0]                 disp_getwork            
   25   0%  98% 0.00     3110 cpu[0]                 bcopy                   

If I see a function that stands out from the rest, I will use the lockstat ‘-f” option to drill down by the kernel function with that name, and use the “-s” option to print the call stack leading up this function:

$ lockstat -kIW -f lzjb_compress -s5 /bin/sleep 5

Profiling interrupt: 703 events in 2.058 seconds (342 events/sec)

Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller                  
  130  18%  18% 0.00     3625 cpu[28]                lzjb_compress           

      nsec ------ Time Distribution ------ count     Stack                   
      2048 |                               2         
      4096 |@@@@@@@@@@@@@@@@@@@@@@@@       107       
      8192 |@@@@                           21        
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller                  
   20   3%  21% 0.00     3529 cpu[37]                lzjb_compress           

      nsec ------ Time Distribution ------ count     Stack                   
      4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    18        0x74                    
      8192 |@@@                            2         zio_compress_data       
Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller                  
   19   3%  24% 0.00     3696 cpu[28]                lzjb_compress           

      nsec ------ Time Distribution ------ count     Stack                   
      4096 |@@@@@@@@@@@@@@@@@@@@@@         14        0x50                    
      8192 |@@@@@@@                        5         zio_compress_data       

I use lockstat quite a bit to observe what the kernel is doing, and to help me figure out where I should look for answers in the opensolaris source code. It’s also useful for determining if you are encountering a kernel bug, since you can compare the backtrace returned from lockstat with the OpenSolaris bug databases.

Great write-up on AMD’s RVI (Rapid Virtualization Indexing) hardware assisted virtualization feature

I came across an awesome Q&Q where Tim Mueting from AMD described the hardware virtualization features in AMD Opteron CPUs. The following excerpt from the interview was especially interesting:

“Prior to the introduction of RVI, software solutions used something called shadow paging to translate a virtual machine “guest” physical address to the system’s physical address. Because the original page table architecture wasn’t designed with virtualization in mind, a mirror of the page tables had to be created in software, called shadow page tables, to keep information about the physical location of “guest” memory. With shadow paging, the hypervisor must keep the shadow page tables “in sync” with the page tables in hardware. Every time the guest OS modifies its page mapping, the hypervisor must adjust the shadow page tables to reflect the modification. The constant updating of the shadow pages tables takes a lot of CPU cycles. As you might expect, for memory intensive applications, this process can make up the largest part of the performance overhead for virtualization.”

“With Rapid Virtualization indexing the virtual memory (Guest OS) to physical memory (Guest OS) and the physical memory (Guest OS) to real physical memory translations are cached in the TLB. As described earlier, we also added a new identifier to the TLB called an Address Space Identifier (ASID) which assigns each entry to a specific VM. With this tag, the TLB entries do not need to be flushed each time execution switches from one VM to another. This simplifies the work that the hypervisor needs to do and removes the need for the hypervisor to update shadow page tables. We can now rely on the hardware to determine the physical location of the guest memory.”

I just ordered a second AMD Opteron 1354 for my lab, and am looking to forward to testing out the VMWare fault tolerance feature once I receive my new CPU. Viva la virtualization!

Viewing the scripts that run when you install a Linux RPM

RPM packages contain the ability to run scripts after a package is added or removed. These scripts can perform actions like adding or removing users, cleaning up temporary files, or checking to make sure a software component that is contained within a package isn’t running. To view the contents of the scripts that will be run, you can use the rpm “–scripts” option:

$ rpm -q –scripts -p VirtualBox-3.1-3.1.4_57640_fedora11-1.x86_64.rpm |more

preinstall scriptlet (using /bin/sh):
# defaults
[ -r /etc/default/virtualbox ] && . /etc/default/virtualbox

# check for active VMs
if pidof VBoxSVC > /dev/null 2>&1; then
  echo "A copy of VirtualBox is currently running.  Please close it and try again. Please not
  echo "that it can take up to ten seconds for VirtualBox (in particular the VBoxSVC daemon) 
  echo "finish running."
  exit 1

RPM provides four types of pre and post installation scripts that can be run:

preinstall scriptlet — this will run before a package is installed
postinstall scriptlet — this will run after a package is installed
preuninstall scriptlet — this will run before a package is uninstalled
postuninstall scriptlet — this will run after a package is uninstalled

There are some awesome RPM options buried in the documentation, and you will definitely want to read through the various RPM resources prior to creating RPMs.

Creating a bootable OpenSolaris USB thumb drive

This past week, I had the need to install opensolaris on a host using a USB thumb drive. To create a bootable USB drive, I first needed to snag the distribution constructor tools via mercurial (I ran these commands from an OpenSolaris host):

$ pkg install SUNWmercurial

$ hg clone ssh://

The caiman slim source Mercurial repository contains a script named usbcopy, which you can use to copy a USB image from the genunix site to your USB drive:

$ usbcopy /nfs/images/osol-0811.usb

Found the following USB devices:
0:      /dev/rdsk/c9t0d0p0      7.6 GB  Patriot Memory   PMAP
Enter the number of your choice: 0

WARNING: All data on your USB storage will be lost.
Are you sure you want to install to
Patriot Memory PMAP, 7600 MB at /dev/rdsk/c9t0d0p0 ?  (y/n) y
Copying and verifying image to USB device
Finished 824 MB in 336 seconds (2.4MB/s)
0 block(s) re-written due to verification failure
Installing grub to USB device /dev/rdsk/c9t0d0s0
Completed copy to USB

After the image was copied, I plugged the drive into my machine and it booted to the opensolaris desktop without issue. From there I did an install and everything appears to be working flawlessly! Nice.