The importance of cleaning up disk headers after testing

Yesterday I was running some benchmarks against a new MySQL server configuration. As part of my testing I wanted to see how things looked with ZFS as the back-end. So I loaded up some SSDs and attempted to create a ZFS pool. Zpool spit out a “device busy” error when I tried to create my pool leading to a confused and bewildered matty. After a bit of tracing I noticed that mdadm was laying claim to my devices:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active (auto-read-only) raid5 sde[0] sdc[2] sdb[1]
1465148928 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

Previously I did some testing with mdadm and it dawned on me that the headers may still be resident on disk. Sure enough, they were:

$ mdadm -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : dc867613:8e75d8e8:046b61bf:26ec6fc5
  Creation Time : Tue Apr 28 19:25:16 2009
     Raid Level : raid5
  Used Dev Size : 732574464 (698.64 GiB 750.16 GB)
     Array Size : 1465148928 (1397.27 GiB 1500.31 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 127

    Update Time : Fri Oct 21 17:18:22 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : d2cbaad - correct
         Events : 4

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       16        1      active sync   /dev/sdb

   0     0       8       64        0      active sync   /dev/sde
   1     1       8       16        1      active sync   /dev/sdb
   2     2       8       32        2      active sync   /dev/sdc

I didn’t run ‘mdadm –zero-superblock’ after my testing so of course md thought it was still the owner of these devices. After I zero’ed the md super block I was able to create my pool without issue. Fun times in the debugging world. :)

Displaying device multi-path statistics on Linux hosts

Linux provides a number of useful utilities for viewing I/O statistics. Atop, iostat, iotop, etc. are extremely useful utilities but they don’t allow you to easily see I/O statistics for each path to a mapper device. To illustrate this lets say you have a mapper device named mpatha with four paths (represented by sdcl, sdnf, sdeq and sdpk in the output below):

$ multipath -ll mpatha
mpatha (12345456789) dm-95 MYARRAY, FastBizzle
size=4.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 1:0:1:25  sdcl 69:144  active ready running
  |- 3:0:2:25  sdnf 71:272  active ready running
  |- 1:0:2:25  sdeq 129:32  active ready running
  `- 3:0:3:25  sdpk 130:416 active ready running

To visualize the amount of I/O traversing down each path to a mapper device we need to combine iostat and grep:

$ iostat -xN 1 | egrep '(mpatha|sdcl|sdnf|sdeq|sdpk)'
sdcl              0.00     0.00    0.20    0.04     2.06     7.35    39.19     0.00    0.88    0.56    2.57   0.86   0.02
sdnf              0.00     0.00    0.20    0.04     2.43     7.09    39.66     0.00    0.81    0.45    2.68   0.79   0.02
sdeq              0.00     0.00    0.20    0.04     1.99     7.06    37.68     0.00    0.90    0.59    2.57   0.88   0.02
sdpk              0.00     0.00    0.20    0.04     1.93     7.03    37.33     0.00    0.79    0.45    2.63   0.77   0.02
mpatha     0.06     0.00    0.52    0.15     5.96    28.53    51.28     0.00    0.98    0.49    2.67   0.86   0.06

This solution requires you to look up the device you want to monitor and the default output is terrible! I wanted a cleaner solution so I wrote mpathstat. Mpathstat parses ‘multipath -ll’ to get the mapper devices on a system as well as the block devices that represent the paths to this device. The script then iterates over an iostat sample and combines everything in one location:

$ mpathstat.py 1
Device Name                                  Reads     Writes    KBytesR/S  KBytesW/S  Await   
mpatha                                       0.00      44.00     0.00       912.00     0.82    
|- sdcj                                      0.00      11.00     0.00       256.00     0.73    
|- sdpi                                      0.00      11.00     0.00       224.00     0.91    
|- sdeo                                      0.00      11.00     0.00       192.00     0.73    
|- sdnd                                      0.00      11.00     0.00       240.00     0.91    
mpathb                                       0.00      3.00      0.00       384.00     1.00    
|- sdmv                                      0.00      1.00      0.00       128.00     1.00    
|- sdcb                                      0.00      0.00      0.00       0.00       0.00    
|- sdpa                                      0.00      1.00      0.00       128.00     1.00    
|- sdeg                                      0.00      1.00      0.00       128.00     1.00    

This has helped me track down a couple of issues with one of our Enterprise storage arrays. If you make any enhancements to the script please send me a pull request on github!

Printing tape drive serial numbers

While automating a process this week I needed a way to get the serial number off a batch of tape drives. At first I thought I could retrieve this information through /sys. But after a bunch of poking around with cat, systool and udevadm I realized I couldn’t get what I want through /sys. One such failure:

$ udevadm info -a -n /dev/nst0 | grep -i serial

If I can’t get the serial # through /sys I can always poke the drive directly, right? Definitely! SCSI vital product data (VPD) is stored on the drive in a series of SCSI mode pages. These pages can be viewed with the sg_vpd utility:

$ sg_vpd –page=0x80 /dev/nst0

VPD INQUIRY: Unit serial number page
  Unit serial number: 123456789012

The command above retrieves SCSI mode page 0x80 (Unit Serial Number) and displays it in an easily parsed format. Seagates SCSI commands reference is an excellent reference for understanding the SCSI protocol.

Creating happy little block devices with UDEV

I’m a long time admirer of Bob Ross and the amazing paintings he produced on his hit tv show the joy of painting. I’m equally a fan of udev and the power it places in administrators hands. While Bob painted amazing clouds, seascapes and mountains with a swipe of his brush I’m able to make /dev just as beautiful with a few keyboard strokes (I guess that makes my keyboard my easel).

This past weekend I decided to clean up and automate the device creation process on several of my database servers. These servers had hundreds of mapper devices which were defined in multipath {} sections in /etc/multipath.conf. The aliases used in these blocks had upper and lower case names so the udev rules file had become a bit of a mess. Here is a snippet to illustrate:

multipath {
    wwid  09876543210
    alias happydev_lun001
}

multipath {
    wwid  01234567890
    alias Happydev_lun002
}

…. more devs ….

By default udev creates /dev/dm-[0-9]+ entries in /dev. I prefer to have the friendly names (happydev_lunXXX in this case) added as well so I don’t have to cross reference the two when I’m dealing with storage issues. To ensure that the friendly names in /dev were consistently named I created the following udev rule:

$ cd /etc/udev.rules.d && cat 90-happydevs.rules

ENV{DM_NAME}=="[Hh]appydev*", NAME="%c" OWNER:="bob", GROUP:="ross", MODE:="660", PROGRAM="/usr/local/bin/lc_devname.sh $env{DM_NAME}"

The rule above checks the DM_NAME to see if it starts with the string [Hh]appydev. If this check succeeds UDEV will call the /usr/local/bin/lc_devname.sh helper script which in turn calls tr to lower() the name passed as an argument. Here are the contents of the lc_devname.sh script:

$ cat /usr/local/bin/lc_devname.sh

#!/bin/sh

echo ${1} | tr '[A-Z]' '[a-z]'

The STDOUT from lc_devname.sh gets assigned to the %c variable which I reference in the NAME key to create a friendly named device in /dev. We can run udevadm test to make sure this works as expected:

$ udevadm test /block/dm-2 2>&1 | grep DEVNAME
udevadm_test: DEVNAME=/dev/happydev_lun001

In the spirit of Bob Ross I udev’ed some pretty little block devices. :)

Locating WWPNs on Linux servers

I do a lot of storage-related work, and often times need to grab WWPNs to zone hosts and to mask storage. To gather the WWPNs I would often times use the following script on my RHEL and CentOS servers:

#!/bin/sh

FC_PATH="/sys/class/fc_host"
 
for fc_adapter in `ls ${FC_PATH}`
do  
    echo "${FC_PATH}/${fc_adapter}:"
    NAME=$(awk '{print $1}' ${FC_PATH}/${fc_adapter}/symbolic_name )
    echo "  $NAME `cat ${FC_PATH}/${fc_adapter}/port_name`"
done

This produced consolidated output similar to:

$ print_wwpns

/sys/class/fc_host/host5:
  QLE2562 0x21000024ff45afc2
/sys/class/fc_host/host6:
  QLE2562 0x21000024ff45afc3

While doing some research I came across sysfsutils, which contains the incredibly useful systool utility. This nifty little tool allows you to query sysfs values, and can be used to display all of the sysfs attributes for your FC adapters:

$ systool -c fc_host -v

Class = "fc_host"

  Class Device = "host5"
  Class Device path = "/sys/class/fc_host/host5"
    fabric_name         = "0x200a547fee9e5e01"
    issue_lip           = 
    node_name           = "0x20000024ff45afc2"
    port_id             = "0x0a00c0"
    port_name           = "0x21000024ff45afc2"
    port_state          = "Online"
    port_type           = "NPort (fabric via point-to-point)"
    speed               = "8 Gbit"
    supported_classes   = "Class 3"
    supported_speeds    = "1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit"
    symbolic_name       = "QLE2562 FW:v5.06.03 DVR:v8.03.07.09.05.08-k"
    system_hostname     = ""
    tgtid_bind_type     = "wwpn (World Wide Port Name)"
    uevent              = 

  Class Device = "host6"
  Class Device path = "/sys/class/fc_host/host6"
    fabric_name         = "0x2014547feeba9381"
    issue_lip           = 
    node_name           = "0x20000024ff45afc3"
    port_id             = "0x9700a0"
    port_name           = "0x21000024ff45afc3"
    port_state          = "Online"
    port_type           = "NPort (fabric via point-to-point)"
    speed               = "8 Gbit"
    supported_classes   = "Class 3"
    supported_speeds    = "1 Gbit, 2 Gbit, 4 Gbit, 8 Gbit"
    symbolic_name       = "QLE2562 FW:v5.06.03 DVR:v8.03.07.09.05.08-k"
    system_hostname     = ""
    tgtid_bind_type     = "wwpn (World Wide Port Name)"
    uevent              = 

This output is extremely useful for storage administrators, and provides everything you need in a nice consolidated form. +1 for systool!

Installing ZFS on a CentOS 6 Linux server

As most of my long term readers know I am a huge Solaris fan. How can’t you love an Operating System that comes with ZFS, DTrace, Zones, FMA and Network Virtualization amongst other things? I use Linux during my day job, and I’ve been hoping for quite some time that Oracle would port one or more of these technologies to Linux. Well the first salvo has been fired, though it wasn’t from Oracle. It comes by way of the ZFS on Linux project, which is an in-kernel implementation of ZFS (this project is different from the FUSE ZFS port).

I had some free time this weekend to play around with ZFS on Linux, and my initial impressions are quite positive. The port on Linux is based on the latest version of ZFS that is part of OpenSolaris (version 28), so things like snapshots, de-duplication, improved performance and ZFS send and recv are available out of the box. There are a few missing items, but from what I can tell from the documentation there is plenty more coming.

The ZFS file system for Linux comes as source code, which you build into loadable kernel modules (this is how they get around the license incompatibilities). The implementation also contains the userland utilities (zfs, zpool, etc.) most Solaris admins are used to, and they act just like their Solaris counterparts! Nice!

My testing occurred on a CentOS 6 machine, specifically 6.2:

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

The build process is quite easy. Prior to compiling source code you will need to install a few dependencies:

$ yum install kernel-devel zlib-devel libuuid-devel libblkid-devel libselinux-devel parted lsscsi

Once these are installed you can retrieve and build spl and zfs packages:

$ wget http://github.com/downloads/zfsonlinux/spl/spl-0.6.0-rc6.tar.gz

$ tar xfvz spl-0.6.0-rc6.tar.gz && cd spl*6

$ ./configure && make rpm

$ rpm -Uvh *.x86_64.rpm

Preparing...                ########################################### [100%]
   1:spl-modules-devel      ########################################### [ 33%]
   2:spl-modules            ########################################### [ 67%]
   3:spl                    ########################################### [100%]

$ wget http://github.com/downloads/zfsonlinux/zfs/zfs-0.6.0-rc6.tar.gz

$ tar xfvz zfs-0.6.0-rc6.tar.gz && cd zfs*6

$ ./configure && make rpm

$ rpm -Uvh *.x86_64.rpm

Preparing...                ########################################### [100%]
   1:zfs-test               ########################################### [ 17%]
   2:zfs-modules-devel      ########################################### [ 33%]
   3:zfs-modules            ########################################### [ 50%]
   4:zfs-dracut             ########################################### [ 67%]
   5:zfs-devel              ########################################### [ 83%]
   6:zfs                    ########################################### [100%]

If everything went as planned you now have the ZFS kernel modules and userland utilities installed! To begin using ZFS you will first need to load the kernel modules with modprobe:

$ modprobe zfs

To verify the module loaded you can tail /var/log/messages:

Feb 12 17:54:27 centos6 kernel: SPL: Loaded module v0.6.0, using hostid 0x00000000
Feb 12 17:54:27 centos6 kernel: zunicode: module license 'CDDL' taints kernel.
Feb 12 17:54:27 centos6 kernel: Disabling lock debugging due to kernel taint
Feb 12 17:54:27 centos6 kernel: ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5

And run lsmod to verify they are there:

$ lsmod | grep -i zfs

zfs                  1038053  0 
zcommon                42478  1 zfs
znvpair                47487  2 zfs,zcommon
zavl                    6925  1 zfs
zunicode              323120  1 zfs
spl                   210887  5 zfs,zcommon,znvpair,zavl,zunicode

To create our first pool we can use the zpool utilities create option:

$ zpool create mysqlpool mirror sdb sdc

The example above created a mirrored pool out of the sdb and sdc block devices. We can see this layout in the output of `zpool status`:

$ zpool status -v

  pool: mysqlpool
 state: ONLINE
 scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	mysqlpool   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdc     ONLINE       0     0     0

errors: No known data errors

Awesome! Since we are at pool version 28 lets disable atime updates and enable compression and deduplication:

$ zfs set compression=on mysqlpool

$ zfs set dedup=on mysqlpool

$ zfs set atime=off mysqlpool

For a somewhat real world test, I stopped one of my MySQL slaves, mounted the pool on /var/lib/mysql, synchronized the previous data over to the ZFS file system and then started MySQL. No errors to report, and MySQL is working just fine. Next up, I trash one side of the mirror and verified that resilvering works:

$ dd if=/dev/zero of=/dev/sdb

$ zpool scrub mysqlpool

I let this run for a few minutes then ran `zpool status` to verify the scrub fixed everything:

$ zpool status -v

  pool: mysqlpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub repaired 966K in 0h0m with 0 errors on Sun Feb 12 18:54:51 2012
config:

	NAME        STATE     READ WRITE CKSUM
	mysqlpool   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdb     ONLINE       0     0   175
	    sdc     ONLINE       0     0     0

I beat on the pool pretty good and didn’t encounter any hangs or kernel oopses. The file systems port is still in its infancy, so I won’t be trusting it with production data quite yet. Hopefully it will mature in the coming months, and if we’re lucky maybe one of the major distributions will begin including it! That would be killer!!