How to ensure sure you can boot if your initrd image has problems

I was playing around with some new kernel bits a few weeks back, and needed to update my initrd image. Having encountered various situations where a box wouldn’t boot due to a botched initrd file, I have become overly protective of this file. Now each time I have to perform an update, I will first create a backup of the file:

$ cp /boot/initrd-2.6.30.10-105.2.23.fc11.x86_64.img /boot/initrd-2.6.30.10-105.2.23.fc11.x86_64.img.bak.05122010

Once I have a working backup, I like to add a menu.lst entry that allows me to restore to a know working state:

title Fedora 11 (2.6.30.10-105.2.23.fc11.x86_64.bak.5122010)
     root (hd0,0)
     kernel /vmlinuz-2.6.30.10-105.2.23.fc11.x86_64 ro root=LABEL=/
     initrd /initrd-2.6.30.10-105.2.23.fc11.x86_64.img.bak.05122010

If my changes cause the machine to fail to boot, I can pick the backup menu entry and I’m off and running. If you don’t want to pollute your menu.lst, you can also specify the initrd manually from the grub command menu. Backups are key, and not having to boot into rescue mode is huge. :)

2.6.32 Linux kernel Virtualization memory De-Duplication

This is pretty sweet.  In the 2.6.32 Linux kernel released yesterday, a new feature of de-duplicating memory of virtualized instances, was introduced.

Modern operative systems already use memory sharing extensively, for example forked processes share initially with its parent all the memory, there are shared libraries, etc. Virtualization however can’t benefit easily from memory sharing. Even when all the VMs are running the same OS with the same kernel and libraries the host kernel can’t know that a lot of those pages are identical and can be shared. KSM allows to share those pages. The KSM kernel daemon, ksmd, periodically scans areas of user memory, looking for pages of identical content which can be replaced by a single write-protected page (which is automatically COW’ed if a process wants to update it). Not all the memory is scanned, the areas to look for candidates for merging are specified by userspace apps using madvise(2): madvise(addr, length, MADV_MERGEABLE).

The result is a dramatic decrease in memory usage in virtualization environments. In a virtualization server, Red Hat found that thanks to KSM, KVM can run as many as 52 Windows XP VMs with 1 GB of RAM each on a server with just 16 GB of RAM. Because KSM works transparently to userspace apps, it can be adopted very easily, and provides huge memory savings for free to current production systems. It was originally developed for use with KVM, but it can be also used with any other virtualization system – or even in non virtualization workloads, for example applications that for some reason have several processes using lots of memory that could be shared.

How the Linux OOM killer works

Most admins have probably experienced failures due to applications leaking memory, or worse yet consuming all of the virtual memory (physical memory + swap) on a host. The Linux kernel has an interesting way of dealing with memory exhaustion, and it comes in the way of the Linux OOM killer. When invoked, the OOM killer will begin terminating processes in order to free up enough memory to keep the system operational. I was curious how the OOM worked, so I decided to spend some time reading through the linux/mm/oom_kill.c Linux kernel source code file to see what the OOM killer does.

The OOM killer uses a point system to pick which processes to execute. The points are assigned by the badness() function, which contains the following block comment:

/**
 * badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @uptime: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

The actual code in this function does the following:

– Processes that have the PF_SWAPOFF flag set will be killed first

– Processes which fork a lot of child processes are next in line

– Kill off niced processes, since they are typically less important

– Superuser processes are usually more important, so try to avoid killing those

The code also takes takes into account the length of time the process has been running, which may or may not be a good thing. It’s interesting to see how technologies we take for granted actually work, and this experience really helped me understand what all the fields in the task_struct structure are used for. Now to dig into mm_struct. :)

Scanning SCSI controllers for new LUNs on Centos and Fedora Linux hosts

While building out a new ESX guest, I had to scan for a new SCSI device I added. To scan a SCSI controller for new LUNs, you can echo the “- – -” string to the SCSI controller’s scan sysfs node:

$ echo “- – -” > /sys/class/scsi_host/host0/scan

Now you may be asking yourself, what do those three dashes mean? Well, here is the answer from the Linux 2.6.31 kernel source (I had to look this up to recall):

static int scsi_scan(struct Scsi_Host *shost, const char *str)
{
        char s1[15], s2[15], s3[15], junk;
        unsigned int channel, id, lun;
        int res;

        res = sscanf(str, "%10s %10s %10s %c", s1, s2, s3, &junk);
        if (res != 3)
                return -EINVAL;
        if (check_set(&channel, s1))
                return -EINVAL;
        if (check_set(&id, s2))
                return -EINVAL;
        if (check_set(&lun, s3))
                return -EINVAL;
        if (shost->transportt->user_scan)
                res = shost->transportt->user_scan(shost, channel, id, lun);
        else
                res = scsi_scan_host_selected(shost, channel, id, lun, 1);
        return res;
}

As you can see above, the three values passed to the scan value are the channel, id and lun number you want to scan. The “-” equates to a wild card, which causes all of the channels, ids and luns to be scanned. The more I dig into the Linux kernel source code, the more I realize just how cool the Linux kernel is. I think it’s about time to write a device driver. :)

Managing /etc/sysctl.conf with the sysctl utility

The Linux kernel provides the sysctl interface to modify values that reside under the /proc/sys directory. Sysctl values are typically stored in /etc/sysctl.conf, and are applied using the sysctl utility. To set a sysctl variable to a specific value, you can run sysctl with the “-w” (change a specific sysctl variable) option:

$ sysctl -w net.ipv4.ip_forward=1

To apply all of the settings in /etc/sysctl.conf to a system, you can run sysctl with the “-p” (apply the sysctl values in /etc/sysctl.conf to a running server) option:

$ sysctl -p

net.ipv4.ip_forward = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 1
kernel.core_uses_pid = 1
net.core.rmem_max = 16842752
net.core.wmem_max = 16842752
net.ipv4.tcp_rmem = 4096 65535 16842752
net.ipv4.tcp_wmem = 4096 65535 16842752

The sysctl interface is pretty powerful, and can you learn more about the individual sysctl variables by perusing the Documentation/sysctl/ directory that ships with the Linux kernel source code.

Viewing the contents of an initrd image

I was doing some research tonight, and needed to look inside my initrd image to see if a couple of device drivers were present. Initrd images are stored as compressed cpio archives, which allows a pipeline like the following to be used to extract the contents of an image:

$ gunzip < initrd-2.6.29.4-167.fc11.x86_64.img | cpio -i --make-directories
14050 blocks

Once extracted, you can use cd and cat to view the files and directories that are part of the image:

$ ls -l

drwxrwxrwt. 11 root root    4096 2009-07-14 23:30 .
drwxr-xr-x. 23 root root    4096 2009-07-07 14:55 ..
drwx------   2 root root    4096 2009-07-14 23:30 bin
drwx------   3 root root    4096 2009-07-14 23:30 dev
drwx------   4 root root    4096 2009-07-14 23:30 etc
-rwx------   1 root root    1938 2009-07-14 23:30 init
-rw-------   1 root root 3428816 2009-07-14 23:29 initrd-2.6.29.4-167.fc11.x86_64.img
drwx------   6 root root    4096 2009-07-14 23:30 lib
drwx------   2 root root    4096 2009-07-14 23:30 lib64
drwx------   2 root root    4096 2009-07-14 23:30 proc
lrwxrwxrwx   1 root root       3 2009-07-14 23:30 sbin -> bin
drwx------   2 root root    4096 2009-07-14 23:30 sys
drwx------   2 root root    4096 2009-07-14 23:30 sysroot
drwx------   4 root root    4096 2009-07-14 23:30 usr

$ ls -l etc

total 20
-rw-r--r-- 1 root root   29 2009-07-14 23:30 fedora-release
-rw-r--r-- 1 root root 1976 2009-07-14 23:30 ld.so.cache
-rw-r--r-- 1 root root   28 2009-07-14 23:30 ld.so.conf
drwx------ 2 root root 4096 2009-07-14 23:30 ld.so.conf.d
drwx------ 2 root root 4096 2009-07-14 23:30 sysconfig
lrwxrwxrwx 1 root root   14 2009-07-14 23:30 system-release -> fedora-release