Generating Netbackup throughput data reports

If you support Netbackup at your site, I’m sure you’ve had to look into issues with slow clients and failed backups. The nbstatus script I mentioned in a previous post is useful for identifying connection problems, but it doesn’t help you understand how well your clients are performing. To help me understand how much data my clients are pushing, I wrote the nbthroughput shell script:

$ nbthroughput

Top 5 hosts by data written

Policy                Schedule              Storage Unit     Bytes       Bytes/s
--------------------  --------------------  ---------------  ----------  -------
ZeusFileSystem        Full-Quarterly        med01-hcart3-    9199683296  39376  
VMWareMBackups        Full-Weekly           med01-hcart3-    1756762304  84219  
VMWareBackups         Cumulative-Increment  med01-hcart3-    1035153514  34155  
VMWareBackups         Cumulative-Increment  med01-hcart3-    879121280   68009  
ApolloFileSystem      Full-Weekly           med01-disk       771900576   19919  

Fastest 5 clients (processed 10MB+)

Policy                Schedule              Storage Unit     Bytes       Bytes/s
--------------------  --------------------  ---------------  ----------  -------
Oracle01FileSystem    Differential-Increme  med01-hcart3-    3609248     128365 
App01FileSystem       Differential-Increme  med01-hcart3-    3569984     128250 
Web01FileSystem       Differential-Increme  med01-hcart3-    3550592     126423 
VMWareBackups         Cumulative-Increment  med01-disk       335576832   100559
ZeusFileSystem        Default-Application-  med01-disk       104857632   93847

Slowest 5 clients (processed 10MB+)

Policy                Schedule              Storage Unit     Bytes       Bytes/s
--------------------  --------------------  ---------------  ----------  -------
W2k3-1FileSystem      Differential-Increme  med01-disk       1298912     333 
W2k3-2FileSystem      Differential-Increme  med01-disk       1482752     2000 
W2k3-3FileSystem      Differential-Increme  med01-disk       1095936     2083 
W2k3-4FileSystem      Differential-Increme  med01-disk       4114880     2425 
W2k3-5FileSystem      Differential-Increme  med01-disk       3496576     2483 

The script will display the fastest clients, the slowest clients, and how much data your clients are pushing to your media servers. I find it useful, so I thought I would post it here for others to use.

Fixing Solaris hosts that boot to a grub> prompt

I applied the latest recommended patch bundle this week to two X4140 servers running Solaris 10. When I rebooted, I was greeted with a grub> prompt instead of the grub menu:

grub>

This wasn’t so good, and for some reason the stage1 / stage2 loaders weren’t installed correctly (or the zpool upgrade caused some issues). To fix this issue, I booted to single user mode by inserting a Solaris 10 update 8 CD and adding “console=ttya -s” to the end of the boot line. Once my box booted, I ran ‘zpool status’ to verify my pool was available:

$ zpool status

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

To re-install the grub stage1 and stage2 loaders, I ran installgrub (you can get the device to use from ‘zpool status’):

$ /sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t0d0s0

stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 272 sectors starting at 50 (abs 16115)

To ensure that the boot archive was up to date, I ran ‘bootadm update-archive’:

$ bootadm update-archive -f -R /a

Creating boot_archive for /a
updating /a/platform/i86pc/boot_archive

Once these changes were made, I init 6’ed the system and it booted successfully. I’ve created quite a grub cheat sheet over the years (this made recovery a snap), and will post it here once I get it cleaned up.

Watching process creation on Linux hosts

I have been debugging a problem with Redhat cluster, and was curious if a specific process was getting executed. On my Solaris 10 hosts I can run execsnoop to observe system-wide process creation, but there isn’t anything comparable on my Linux hosts. The best I’ve found is systemtap, which provides the kprocess.exec probe to monitor exec()’s. To access this probe, you can stash the following in a file of your choosing:

probe kprocess.exec {
    printf("%s (pid: %d) is exec'ing %s\n", execname(), pid(), filename)
}

Once the file is created, you can execute the stap program to enable the exec probe:

$ stap -v exec.stp

This will produce output similar to the following:

modclusterd (pid: 5125) is exec'ing /usr/sbin/clustat
clurgmgrd (pid: 5129) is exec'ing /usr/share/cluster/clusterfs.sh
clusterfs.sh (pid: 5130) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5131) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5132) is exec'ing /usr/bin/dirname
clusterfs.sh (pid: 5134) is exec'ing /bin/basename
clusterfs.sh (pid: 5135) is exec'ing /sbin/consoletype
clusterfs.sh (pid: 5138) is exec'ing /usr/bin/readlink

While systemtap is missing various features that are available in DTrace, it’s still a super useful tool!

Cleaning up space used by the OpenSolaris pkg utility

I’ve been experimenting with the new OpenSolaris package manager (pkg), and ran into an odd issue last weekend. The flash drive I was running image-update on filled up, and after poking around I noticed that /var/pkg had some large directories:

$ cd /var/pkg

$ du -sh *

4.6M    catalog
1.5K    cfg_cache
892M    download
1.5K    file
4.5K    history
19M     index
1.5K    lost+found
146M    pkg
318K    state

In this specific case, pkg downloaded close to 900MB to the download directory, but failed to remove the downloaded files once the image was updated. :( The pkg tool currently doesn’t have a purge option to remove old stuff in this directory, so I had to go in manually and remove everything in the download directory. It appears bug #2266 is open to address this, and removing the contents from the download directory is safe (at least according to a post I read on the pkg mailing list). I think I prefer yum to the pkg tool, but hopefully I can be swayed once pkg matures a bit more!

Scanning Linux hosts for newly added ESX storage devices

I currently support a number of Linux hosts that run inside VMWare vSphere server. Periodically I need to add new storage devices to these hosts, which requires me to login to the vSphere client and add the device through the “edit settings” selection. The cool thing about vSphere is that the LUNs are dynamically added to the guest, and the guest will see the devices once the SCSI bus has been scanned. There are several ways to scan for storage devices, but the simplest way I’ve found is to use the rescan-scsi-bus.sh shell script that comes with the sg3_utils package.

To use rescan-scsi-bus.sh, you will first need to install the sg3_utils package:

$ yum install sg3_utils

Once installed, you can run fdisk or lsscsi to view the devices on your system:

$ lsscsi

[0:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sda
[1:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sdb

$ fdisk -l

Disk /dev/sda: 9663 MB, 9663676416 bytes
255 heads, 63 sectors/track, 1174 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1         651     5229126   83  Linux
/dev/sda2             652        1173     4192965   82  Linux swap / Solaris

Disk /dev/sdb: 19.3 GB, 19327352832 bytes
255 heads, 63 sectors/track, 2349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1        2349    18868311   83  Linux

As you can see above, we currently have two physical devices connected to the system. To scan for a new device I just added, we can run rescan-scsi-bus.sh from the host:

$ /usr/bin/rescan-scsi-bus.sh -r

Host adapter 0 (mptspi) found.
Host adapter 1 (mptspi) found.
Scanning SCSI subsystem for new devices
 and remove devices that have disappeared
Scanning host 0 for  SCSI target IDs  0 1 2 3 4 5 6 7, all LUNs
Scanning for device 0 0 0 0 ...
OLD: Host: scsi0 Channel: 00 Id: 00 Lun: 00
      Vendor: VMware   Model: Virtual disk     Rev: 1.0 
      Type:   Direct-Access                    ANSI SCSI revision: 02
Scanning host 1 channels  0 for  SCSI target IDs  0 1 2 3 4 5 6 7, all LUNs
Scanning for device 1 0 0 0 ...
OLD: Host: scsi1 Channel: 00 Id: 00 Lun: 00
      Vendor: VMware   Model: Virtual disk     Rev: 1.0 
      Type:   Direct-Access                    ANSI SCSI revision: 02
Scanning for device 1 0 1 0 ...
NEW: Host: scsi1 Channel: 00 Id: 01 Lun: 00
      Vendor: VMware   Model: Virtual disk     Rev: 1.0 
      Type:   Direct-Access                    ANSI SCSI revision: 02
Scanning for device 1 0 1 0 ...
OLD: Host: scsi1 Channel: 00 Id: 01 Lun: 00
      Vendor: VMware   Model: Virtual disk     Rev: 1.0 
      Type:   Direct-Access                    ANSI SCSI revision: 02
0 new device(s) found.               
0 device(s) removed.                 

The scan output will show you the devices it finds, and as you can see above, it was able to locate 3 “Virtual disk” drives. To verify the machine sees the drives, we can run lsscsi and fdisk again:

$ lsscsi

[0:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sda
[1:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sdb
[1:0:1:0]    disk    VMware   Virtual disk     1.0   /dev/sdc

$ fdisk -l

Disk /dev/sda: 9663 MB, 9663676416 bytes
255 heads, 63 sectors/track, 1174 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1         651     5229126   83  Linux
/dev/sda2             652        1173     4192965   82  Linux swap / Solaris

Disk /dev/sdb: 19.3 GB, 19327352832 bytes
255 heads, 63 sectors/track, 2349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1        2349    18868311   83  Linux

Disk /dev/sdc: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

There are various solutions to check for new SCSI devices (Emulex has a tool, QLogic has a tool, you can scan for devices by hand, etc.), the rescan script has proven to be the easiest solution for me. Since I didn’t write the rescan script, use this information at your own risk. It *should* work flawlessly, but you’re on your own when you run it!

Displaying a file or device in hex

A while back I came across the hexdump utility, which allows you to dump the contents of a device or file in hexadecimal:

$ dd if=/dev/hda count=1 | hexdump -x

1+0 records in
1+0 records out
0000000    48eb    1090    d08e    00bc    b8b0    0000    d88e    c08e
0000010    befb    7c00    00bf    b906    0200    a4f3    21ea    0006
0000020    be00    07be    0438    0b75    c683    8110    fefe    7507
0000030    ebf3    b416    b002    bb01    7c00    80b2    748a    0203
0000040    0080    8000    e051    0001    0800    80fa    80ca    53ea

This is a super useful utility!