Managing 100s of Linux and Solaris machines with clusterit


I use numerous tools to perform my SysAdmin duties. One of my favorite tools it clusterit, which is a suite of programs that allows you to run commands across one or more machines in parallel. To begin using the awesomeness that is clusterit, you will first need to download and install the software. This is as easy as:

$ wget http://prdownloads.sourceforge.net/clusterit/clusterit-2.5.tar.gz

$ tar xfvz clusterit.gz

$ cd clusterit && ./configure --prefix=/usr/local/clusterit && make && make install

Once the software is installed, you should have a set of binaries and manual pages in /usr/local/clusterit. To use the various tools in the clusterit/bin directory, you will first need to create one or more cluster files. Each cluster file contains a list of hosts you want to manage as a group, and each host is separated by a newline. Here is an example:

$ cat servers
foo1 foo2 foo3 foo4 foo5

The cluster file listed above contains 5 servers named foo1 - foo5. To tell clusterit you want to use this list of hosts, you will need to export the file via the $CLUSTER environment variable:

$ export CLUSTER=/home/matty/clusters/servers

Once you specify the list of hosts you want to use in the $CLUSTER variable, you can start using the various tools. One of the handiest tools is dsh, which allows you to run commands across the hosts in parallel:

$ dsh uptime

foo1 : 2:17pm up 8 day(s), 23:37, 1 user, load average: 0.06, 0.06, 0.06
foo2 : 2:17pm up 8 day(s), 23:56, 0 users, load average: 0.03, 0.03, 0.02
foo3 : 2:17pm up 7 day(s), 23:32, 1 user, load average: 0.27, 2.04, 3.21
foo4 : 2:17pm up 7 day(s), 23:33, 1 user, load average: 3.98, 2.07, 0.96
foo5 : 2:17pm up 5:06, 0 users, load average: 0.08, 0.09, 0.09

In the example above I ran the uptime command across all the servers listed in file that is referenced by the CLUSTER variable! You can also do more complex activities through dsh:

$ dsh 'if uname -a | grep SunOS >/dev/null; then echo Solaris; fi'

foo1 : Solaris
foo2 : Solaris
foo3 : Solaris
foo4 : Solaris
foo5 : Solaris

This example uses dsh to run uname across a batch of servers, and prints the string Solaris if the keyword “SunOS” is found in the uname output. Clusterit also comes with a distributed scp command called pcp, which you can use to copy a file to a number of hosts in parallel:

$ pcp /etc/services /tmp

services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00
services 100% 616KB 616.2KB/s 00:00

$ openssl md5 /etc/services

MD5(/etc/services)= 14801984e8caa4ea3efb44358de3bb91

$ dsh openssl md5 /tmp/services

foo1 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo2 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo3 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo4 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91
foo5 : MD5(/tmp/services)= 14801984e8caa4ea3efb44358de3bb91

In this example I am using pcp to copy the file /etc/services to each host, and then using dsh to create a checksum of the file that was copied. Clusterit also comes with a distributed top (dtop), distributed df (pdf) as well as a number of job control tools! If you are currently performing management operations using the old for stanza:

for i in `cat hosts`
do
    ssh $host 'run_some_command'
done

You really owe it to yourself to set up clusterit. You will be glad you did!

This article was posted by Matty on 2010-05-28 10:11:00 -0400 -0400