I’ve been crazy busy over the past few months. In addition to preparing for my RHCE exam in July, I have also been studying for the Microsoft Windows Server 2008 MCITP certification. This is a huge change for me, and I wouldn’t have thought in a million years that I would be so focused on learning everything there is to know about Windows. But the reality is almost EVERY company out there runs Microsoft software, and to truly solve problems you need to know what each OS and application is capable of.
The more I mess around with Windows 2008, active directory, Windows cluster server and the various applications that run on top of Windows Server, the more I’m starting to like it. While I don’t expect to become a full time Windows administrator (vSphere, Solaris, Linux, AIX and storage will continue to be #1 on my list of things to learn about), I definitely have found a new appreciation for Windows and hope to use it more in the future. If you are a heavy Windows Server user, please let me know what you like and dislike about it. I’ll share my list in a follow up post.
I use numerous tools to perform my SysAdmin duties. One of my favorite tools it clusterit, which is a suite of programs that allows you to run commands across one or more machines in parallel. To begin using the awesomeness that is clusterit, you will first need to download and install the software. This is as easy as:
$ wget http://prdownloads.sourceforge.net/clusterit/clusterit-2.5.tar.gz
$ tar xfvz clusterit.gz
$ cd clusterit && ./configure --prefix=/usr/local/clusterit && make && make install
Once the software is installed, you should have a set of binaries and manual pages in /usr/local/clusterit. To use the various tools in the clusterit/bin directory, you will first need to create one or more cluster files. Each cluster file contains a list of hosts you want to manage as a group, and each host is separated by a newline. Here is an example:
$ cat servers
foo1
foo2
foo3
foo4
foo5
The cluster file listed above contains 5 servers named foo1 - foo5. To tell clusterit you want to use this list of hosts, you will need to export the file via the $CLUSTER environment variable:
$ export CLUSTER=/home/matty/clusters/servers
Once you specify the list of hosts you want to use in the $CLUSTER variable, you can start using the various tools. One of the handiest tools is dsh, which allows you to run commands across the hosts in parallel:
$ dsh uptime
foo1 : 2:17pm up 8 day(s), 23:37, 1 user, load average: 0.06, 0.06, 0.06
foo2 : 2:17pm up 8 day(s), 23:56, 0 users, load average: 0.03, 0.03, 0.02
foo3 : 2:17pm up 7 day(s), 23:32, 1 user, load average: 0.27, 2.04, 3.21
foo4 : 2:17pm up 7 day(s), 23:33, 1 user, load average: 3.98, 2.07, 0.96
foo5 : 2:17pm up 5:06, 0 users, load average: 0.08, 0.09, 0.09
In the example above I ran the uptime command across all the servers listed in file that is referenced by the CLUSTER variable! You can also do more complex activities through dsh:
$ dsh 'if uname -a | grep SunOS >/dev/null; then echo Solaris; fi'
foo1 : Solaris
foo2 : Solaris
foo3 : Solaris
foo4 : Solaris
foo5 : Solaris
This example uses dsh to run uname across a batch of servers, and prints the string Solaris if the keyword “SunOS” is found in the uname output. Clusterit also comes with a distributed scp command called pcp, which you can use to copy a file to a number of hosts in parallel:
$ pcp /etc/services /tmp
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
$ openssl md5 /etc/services
MD5(/etc/services)= 14801984e8caa4ea3efb44358de3bb91
$ dsh openssl md5 /tmp/services
foo1 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo2 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo3 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo4 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo5 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
In this example I am using pcp to copy the file /etc/services to each host, and then using dsh to create a checksum of the file that was copied. Clusterit also comes with a distributed top (dtop), distributed df (pdf) as well as a number of job control tools! If you are currently performing management operations using the old for stanza:
for i in `cat hosts`
do
ssh $host 'run_some_command'
done
You really owe it to yourself to set up clusterit. You will be glad you did!
I was playing around with some new kernel bits a few weeks back, and needed to update my initrd image. Having encountered various situations where a box wouldn’t boot due to a botched initrd file, I have become overly protective of this file. Now each time I have to perform an update, I will first create a backup of the file:
$ cp /boot/initrd-2.6.30.10-105.2.23.fc11.x86_64.img
/boot/initrd-2.6.30.10-105.2.23.fc11.x86_64.img.bak.05122010**
Once I have a working backup, I like to add a menu.lst entry that allows me to restore to a know working state:
title Fedora 11 (2.6.30.10-105.2.23.fc11.x86_64.bak.5122010)
root (hd0,0)
kernel /vmlinuz-2.6.30.10-105.2.23.fc11.x86_64 ro root=LABEL=/
initrd /initrd-2.6.30.10-105.2.23.fc11.x86_64.img.bak.05122010
If my changes cause the machine to fail to boot, I can pick the backup menu entry and I’m off and running. If you don’t want to pollute your menu.lst, you can also specify the initrd manually from the grub command menu. Backups are key, and not having to boot into rescue mode is huge. :)
I’ve been debugging an odd issue with an 8 link (nxge interfaces) aggregation on a Solaris 10 host. When this issue rears its ugly head, one or more of the interfaces in the aggregation go offline. I’m still trying to track down if this is a Cisco or a Solaris issue, and have been patching the server to make sure all of the drivers are current. When I applied the latest aggr and nxge patches, I noticed that these patches required four additional patches:
Required Patches: 127127-11 137137-09 139555-08 141444-09 (or greater)
To see if these patches were applied, I fired up my old friend awk:
$ showrev -p | nawk '2 ~ /127127/ { print 2}'
127127-11
$ showrev -p | nawk '2 ~ /137137/ { print 2}'
137137-09
$ showrev -p | nawk '2 ~ /139555/ { print 2}'
139555-08
$ showrev -p | nawk '2 ~ /141444/ { print 2}'
141444
All of the patches I needed were indeed applied, so I was able to apply the nxge and aggr patches without issue. For those that are interested, I learned that standalone interface patches are not included with the latest patch bundle. Driver patches are only included in the patch bundle when they are rolled into a kernel update. This was news to me.
Reading through my RSS feeds, I cameacross the following blog post describing one Linux administrator using tune2fs to disable the “please run fsck on this file system after X days or Y mounts.” I’ve got to admit, this is kind of annoying. I’ve taken production critical Linux boxes down for some maintenance, only to have the downtime extended +15-30 minutes because the file system was configured to run a fsck. Google searching this topic even shows other administrators trying other stupid tactics to avoid the fsck on reboot.
Is there really any value on having fsck run after some period of time? On Unix based systems (and even in Windows), fsck (or chkdisk) only runs when the kernel notices that a file system is in some sort of inconsistent state. So then I ask, why did the Linux community decide to run fsck on file systems in consistent state? ZFS has a “scrub” operation that can be ran against a dataset, but even that is comparing block level checksums. Ext2/3, RiserFS, XFS don’t perform block level checksums (btrfs does) so why the need to run fsck after some period of time? Does running fsck give folks the warm n’ fuzzies that their data is clean, or is there some deeper technical reason why this is scheduled? If you have any answers / historical data, please feel free to share. =)