Debugging kickstart issues

Kickstart is used by numerous organizations to automate Redhat, Fedora and CentOS installations, and has numerous options to control the provisioning process. One thing I really dig about kickstart is the logging that occurs to tty2, tty3 and tty4. This information is useful for debugging build issues, since the last messages usually indicate where a problem resides. But like most admins, I multitask and typically kick off installs and come back at a later time expecting to login and configure the new host. Periodically hosts will fail to build, but will be in a state where the debug messages are no longer available. To ensure that I can easily debug issues similar to this, I like to add the syslog and loglevel directives to the list of kickstart options:

kernel /centos53/vmlinuz ip=dhcp syslog=192.168.1.5 loglevel=debug ksdevice=eth0 \
load_ramdisk=1 prompt_ramdisk=0 ramdisk_size=32768 network \
ks=http://192.168.1.5/centos/kickstart.cfg

The syslog directives tells the anaconda installer to send its output to the specified syslog server IP, and the loglevel directive controls how much data is logged. Once both directives are enabled, you will get output similar to the following each time a host builds:

May 20 22:36:15 drfeelgood INFO     trying to mount sda1 on /
May 20 22:36:15 drfeelgood INFO     set SELinux context for mountpoint / to system_u:object_r:root_t:s0
May 20 22:36:17 drfeelgood INFO     trying to mount sys on /sys
May 20 22:36:17 drfeelgood INFO     set SELinux context for mountpoint /sys to None
May 20 22:36:17 drfeelgood INFO     trying to mount proc on /proc
May 20 22:36:17 drfeelgood INFO     set SELinux context for mountpoint /proc to None
May 20 22:36:17 drfeelgood INFO     moving (1) to step migratefilesystems
May 20 22:36:17 drfeelgood INFO     moving (1) to step setuptime
May 20 22:36:17 drfeelgood INFO     moving (1) to step preinstallconfig



To locate errors or warnings, you can run egrep passing the strings WARN and ERROR as input:

$ egrep "drfeelgood.*(WARN|ERROR)" /var/adm/messages
May 20 22:36:17 drfeelgood WARNING  no dev package, going to bind mount /dev
May 20 22:36:38 drfeelgood WARNING  Try 1/10 for http://192.168.1.5/centos/5.3/x86_64/os/CentOS/nspr-devel-4.7.3-2.el5.x86_64.rpm failed
May 20 22:38:47 drfeelgood WARNING  /usr/lib64/python2.4/site-packages/snack.py:250: DeprecationWarning: integer argument expected, got float\n  self.w = _snack.scale(width, total)



Kickstart is pretty sweet, and it is one of those technologies that makes our lives so much easier!

Exiting from a shell script when a failure occurs

While debugging an issue with one of my scripts, I wanted to abort execution and exit when a non-zero return code occurred. I recalled reading about a bash flag that provided this behaviour, and after a few minutes of reading through bash(1) I came across the following set option:

“-e Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or|| list, or if the command’s return value is being inverted via !. A trap on ERR, if set, is executed before the shell exits.”

So given the following test script that calls a non-existent command /bin/bork:

$ cat test
/bin/bork
echo “Made it after the bork”

We can see that a normal run completes to the end:

$ sh test
test: line 1: /bin/bork: No such file or directory
Made it after the bork

But an invocation with the “-e” option exits when a non-zero return value occurs:

$ sh -e test
test: line 1: /bin/bork: No such file or directory

The bash manual page is full of super interesting stuff, and I just printed it out to learn more! Nice!

Duplicate RPM names showing up in the rpm query output

I had to install some software last night on one of my 64-bit CentOS Linux hosts, and noticed that glibc was listed twice in my rpm query:

$ rpm -q -a | grep glibc-2.5-34
glibc-2.5-34
glibc-2.5-34

At first I thought my RPM package database was borked, but then it dawned on me that there are probably 32- and 64-bit packages installed. To verify this, I used a custom rpm query string that displayed the archiecture in addition to the package name and version:

$ rpm -qa –qf “%{name}-%{version}-%{release}.%{arch}\n” | grep glibc-2.5-34
glibc-2.5-34.i686
glibc-2.5-34.x86_64

This was indeed the case, and a full listing of each package showed that 32-bit libraries went into /lib, and 64-bit libraries got stashed into /lib64. I’m not sure why the default rpm query output doesn’t contain the package architecture, but it appears adding the following entry to /etc/rpm/macros fixes this (credit to the CentOS mailing list for the macro):

$ grep query /etc/rpm/macros
%_query_all_fmt %%{name}-%%{version}-%%{release}.%%{arch}

$ rpm -q -a | grep glibc-2.5
glibc-2.5-34.i686
glibc-2.5-34.x86_64

Managing COMSTAR logical units with sbdadm

With project COMSTAR, you can present logical units to one or more hosts connected via Ethernet or fibre channel. Logical units are managed with the sbdadm utility, which has options to add, delete, list, import and modify logical units. To create a logical unit using a ZFS volume, the path to the ZFS volume can be passed to the “create-lu” option:

$ zfs create -V 18g bits/testvol

$ sbdadm create-lu /dev/zvol/rdsk/bits/testvol

Created the following LU:

              GUID                    DATA SIZE           SOURCE
--------------------------------  -------------------  ----------------
600144f02d65840000004a04c6b90004      19327287296      /dev/zvol/rdsk/bits/testvol



If you would prefer to use a file instead of a volume, you can do that as well:

$ touch /foo

$ sbdadm create-lu -s 100g /foo

Created the following LU:

              GUID                    DATA SIZE           SOURCE
--------------------------------  -------------------  ----------------
600144f02d65840000004a117a790002      107374182400     /foo



In the example above, I used the COMSTAR thin provisioning support to create a 100GB logical unit using the file /tmp. After a logical unit is added, it can be listed with the sbdadm “list-lu” option:

$ sbdadm list-lu

Found 4 LU(s)

              GUID                    DATA SIZE           SOURCE
--------------------------------  -------------------  ----------------
600144f02d65840000004a117a790002      26843545600      /foo
600144f02d65840000004a04c6b90004      19327287296      /dev/zvol/rdsk/bits/testvol
600144f02d65840000004a049b880001      107374116864     /dev/zvol/rdsk/bits/nevada110disk1
600144f02d658400000049fba1bf0001      107374116864     /dev/zvol/rdsk/bits/centosnode1



If you decide that a logical unit is no longer needed, you can remove it with the sbdadm “delete-lu” option:

$ sbdadm delete-lu 600144f02d65840000004a04c6b90004

If you need to increase or decrease the size of a logical unit, you can use the sbdadm “modify-lu” option:

$ sbdadm modify-lu -s 25g 600144f02d65840000004a117a790002

LU modified Successfully.



There is also an “import-lu” option which can be used to import a volume or file into COMSTAR. Nice!

Getting sed to substitute a newline on Linux and Solaris hosts

While crafting an install script a week or two ago, I came across an annoying issue with the Solaris sed utility. When I tried to substitute the string ‘, ‘ with a newline, I got this:

$ grep foo gemlist | sed -e ‘s/foo.*(//’ -e ‘s/)//’ -e ‘s/, /\n/g’
2.4n2.3n2.2n2.1

But when I ran the same sustitution with GNU sed, everything worked as expected:

$ grep foo gemlist | gsed -e ‘s/foo.*(//’ -e ‘s/)//’ -e ‘s/, /\n/g’
2.4
2.3
2.2
2.1

It turns out that the Solaris sed is extremely dated, and requires an explicit newline instead of a ‘\n’:

$ grep foo gemlist | gsed -e ‘s/foo.*(//’ -e ‘s/)//’ -e ‘s/, /\
$>/g’
2.4
2.3
2.2
2.1

In the end I found it easier to write my installer in Python, which made life much much easier.

Figuring out what a hung Solaris process is doing inside the kernel

I had a a process hang last week on one of my Solaris hosts, and was curious what each thread was doing. The mdb utility is perfect for locating this information, since you an combine pid2proc with the walk and findstack dcmds to get the call stack of each thread in a process (in the example below, I am examining process id 48):

$ mdb -k

> 0t48::pid2proc |::walk thread |::findstack -v
stack pointer for thread ffffff02d6fd1080: ffffff000fac1df0
[ ffffff000fac1df0 _resume_from_idle+0xf1() ]
  ffffff000fac1e20 swtch+0x147()
  ffffff000fac1e80 cv_wait_sig_swap_core+0x170(ffffff02d6fd1256, ffffff02d6fd1258, 0)
  ffffff000fac1ea0 cv_wait_sig_swap+0x18(ffffff02d6fd1256, ffffff02d6fd1258)
  ffffff000fac1ec0 pause+0x48()
  ffffff000fac1f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6fccc00: ffffff0011184d10
[ ffffff0011184d10 _resume_from_idle+0xf1() ]
  ffffff0011184d50 swtch_to+0xe5(ffffff02eae948e0)
  ffffff0011184db0 shuttle_resume+0x328(ffffff02eae948e0, ffffffffc00c2ed0)
  ffffff0011184e50 door_return+0x21a(fedae9f8, 408, 0, 0, fedaee00, f5f00)
  ffffff0011184ec0 doorfs32+0x134(fedae9f8, 408, 0, fedaee00, f5f00, a)
  ffffff0011184f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6b18c20: ffffff000ff00d30
[ ffffff000ff00d30 _resume_from_idle+0xf1() ]
  ffffff000ff00d60 swtch+0x147()
  ffffff000ff00db0 shuttle_swtch+0x259(ffffffffc00c2ed0)
  ffffff000ff00e50 door_return+0x242(0, 0, 0, 0, fec9ee00, f5f00)
  ffffff000ff00ec0 doorfs32+0x134(0, 0, 0, fec9ee00, f5f00, a)
  ffffff000ff00f10 sys_syscall32+0x101()
stack pointer for thread ffffff02d6abb400: ffffff000f9dab30
[ ffffff000f9dab30 _resume_from_idle+0xf1() ]
  ffffff000f9dab60 swtch+0x147()
  ffffff000f9dabe0 cv_timedwait_sig_internal+0x1d6(ffffff02d60a30dc, ffffff02d60a30e0, fb24, 0)
  ffffff000f9dac10 cv_timedwait_sig+0x1f(ffffff02d60a30dc, ffffff02d60a30e0, fb24)
  ffffff000f9dac40 kcf_svc_wait+0x86(ffffff000f9dac5c)
  ffffff000f9dacc0 cryptoadm_ioctl+0xe0(9a00000000, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4)
  ffffff000f9dad00 cdev_ioctl+0x45(9a00000000, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4)
  ffffff000f9dad40 spec_ioctl+0x83(ffffff02d68eca00, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4, 0)                
  ffffff000f9dadc0 fop_ioctl+0x7b(ffffff02d68eca00, 790d, feb91fc0, 100003, ffffff02d5747248, 
  ffffff000f9dade4, 0)
  ffffff000f9daec0 ioctl+0x18e(3, 790d, feb91fc0)
  ffffff000f9daf10 sys_syscall32+0x101()
stack pointer for thread ffffff02d69fa540: ffffff000f9c1b30
[ ffffff000f9c1b30 _resume_from_idle+0xf1() ]
  ffffff000f9c1b60 swtch+0x147()
  ffffff000f9c1be0 cv_timedwait_sig_internal+0x1d6(ffffff02d616e5e0, ffffff02d616e5e8, fb24, 0)
  ffffff000f9c1c10 cv_timedwait_sig+0x1f(ffffff02d616e5e0, ffffff02d616e5e8, fb24)
  ffffff000f9c1c40 kcf_svc_do_run+0x7d()
  ffffff000f9c1cc0 cryptoadm_ioctl+0x9b(9a00000000, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4)
  ffffff000f9c1d00 cdev_ioctl+0x45(9a00000000, 790e, 8, 100003, ffffff02d5747248, ffffff000f9c1de4)
  ffffff000f9c1d40 spec_ioctl+0x83(ffffff02d68eca00, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4, 0)
  ffffff000f9c1dc0 fop_ioctl+0x7b(ffffff02d68eca00, 790e, 8, 100003, ffffff02d5747248, 
  ffffff000f9c1de4, 0)
  ffffff000f9c1ec0 ioctl+0x18e(3, 790e, 8)
  ffffff000f9c1f10 sys_syscall32+0x101()



It turns out the issue I encountered was due to a bug, which will hopefully be fixed in the near future.