Why oh why is grub eating CPU resources in a VirtualBox VM?

This article was posted by Matty on 2010-01-12 21:14:00 -0400 -0400

While reviewing the performance on my desktop today, I noticed that one of my VirtualBox virtual machines was consuming 100% of one CPU:

Swap: 2097144k total, 59504k used, 2037640k free, 4004568k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1895 matty 20 0 2037m 1.0g 57m S 96.2 13.0 831:13.06 VirtualBox

This was somewhat perplexing, given that the host in question was sitting at a grub> prompt (I’m cleaning up my grub notes to post to my website, so stay tuned!). When I strace’ed the VirtualBox process (having a system encapsulated in a userland process makes debugging these types of issues super easy), I noticed that the process was issuing poll()s and read()s in a tight loop:

[pid 1895] poll([{fd=15, events=POLLIN}, {fd=22, events=POLLIN|POLLPRI}, {fd=24, events=POLLIN|POLLPRI}, {fd=25, events=POLLIN|POLLPRI}, {fd=26, events=POLLIN|POLLPRI}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=14, events=POLLIN}, {fd=29, events=POLLIN}, {fd=34, events=POLLIN}], 10, 0) = 0 (Timeout)
[pid 1895] read(14, 0x1492424, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 1895] read(14, 0x1492424, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 1895] read(27, 0x14fda94, 4096) = -1 EAGAIN (Resource temporarily unavailable)

At first I was perplexed by this, but upon further reflection this makes complete sense. The grub interpeter is executing a loop that polls the keyboard IO port for data, and continues to do so over and over again. Since most systems don’t stay at the grub prompt for extended periods of time, the grub developers didn’t use the HLT instruction to idle the CPUs when no actual work was being performed. For some reason this made me extremely curious about microprocessor implementations, so I started reading through the AMD64 Architecture Programmer’s Manual volume 1. So far this is a fantastic read, and I wish I would have read through it years ago!

Configuring and monitoring the T5220 hardware RAID controller

This article was posted by Matty on 2010-01-12 21:08:00 -0400 -0400

The Sun T5220 comes with a built-in RAID controller, which supports all of the standard RAID levels (0 - 6). Configuring one or more devices to participate in a RAID Configuration is dead simple, since you can use the Solaris raidctl utility. The last T5220 I configured had a root file system that was going to reside on the built-in RAID controller, so I had to boot into single user mode to create my volume. To create a RAID1 volume using the devices c1t0d0 and c1t1d0 (you can get the devices via format or raidctl), you can run raidctl with the “-c” (create raid volume) option, and the names of the disks to mirror:

$ raidctl -c c1t0d0 c1t1d0

Creating RAID volume will destroy all data on spare space of member disks, proceed (yes/no)? yes
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Physical disk 0 created.
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Physical disk 1 created.
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Volume 0 created.
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Physical disk (target 1) is |out of sync||online|
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Volume 0 is |enabled||degraded|
/pci@0/pci@0/pci@2/scsi@0 (mpt0):
Volume 0 is |enabled||resyncing||degraded|

I also wanted to be able to use the cache on the RAID controller, which can be enabled using the raidctl “-p” (set property) option:

$ raidctl -p "wp=on" c1t0d0

Once I had a working RAID1 volume, I created a label on the device with fdisk and proceeded to perform a Solaris 10 installation. After the volume synchronized and Solaris was re-installed, I was able to run raidctl with the “-l” option to display the state of the volume:

$ raidctl -l c1t0d0

Volume Size Stripe Status Cache RAID
Sub Size Level
Disk
----------------------------------------------------------------
c1t0d0 136.6G N/A OPTIMAL ON RAID1
0.0.0 136.6G GOOD
0.1.0 136.6G GOOD

The raidctl utility is rather handy, and I created a checklsi script that can be run from cron to check the status of your RAID controllers (from some limited testing it appears FMA doesn’t detect disk faults).

Awesome vSphere server videos

This article was posted by Matty on 2010-01-01 21:51:00 -0400 -0400

While I was studying for the VCP4 certification, I watched a number of awesome educational vSphere videos from Mike Laverick. Mike did a killer job with the videos, and they clearly explain a number of vSphere technologies in detail! Thanks a ton Mike for making these available!!!

How to become a VMware certified professional (VCP4)

This article was posted by Matty on 2010-01-01 13:00:00 -0400 -0400

I passed the Vmware certified professional 4 (VCP4) exam this past Monday. The exam was a bit more difficult than I expected, though I passed it with flying colors. If you are thinking about taking the exam, or are interested in learning more about vSphere, you will definitely want to start out by reading Scott Lowe’s Mastering VMware vSphere book. Scott did an excellent job putting the book together, and it’s concise and easy to follow (I’m curious how Mike Laverick’s vSphere implementation book compares to Scott’s).

After you read through Scott’s book, you should check out Simon Long’s study notes! Simon has links from the VCP4 blueprint to every piece of documentation you will need to pass the test, and you will learn a ton in the process! Now, there will come a morning or afternoon when you need to wander to the testing center to sign your life away and take the exam. Get there 30 minutes early, and look over the vreference card! This card contains a TON of ESX material in a single location, and will ensure that all of the information you studied is fresh in your mind. Best of luck to anyone taking the exam!

Using the Linux parted utility to re-create a lost partition table

This article was posted by Matty on 2009-12-20 16:08:00 -0400 -0400

I ran into an issue last week where two nodes using shared storage lost the partition table on one of the storage devices they were accessing. This was extremely evident in the output from fdisk:

$ fdisk -l /dev/sdb /dev/sdb1 /dev/sdb9

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 *** 512 = 8225280 bytes

Device Boot Start End Blocks Id System

Fortunately the system in question still had the /proc/partition data in tact, so I stashed that in a safe location and came up with a plan to re-recreate the partition tables using this data. Given the following /proc/partitions data for the device listed above:

$ grep sdb /proc/partitions

8 16 104857600 sdb
8 17 99161148 sdb1
8 18 5695042 sdb2

I can see that the device had two partion table entries, and the device was approximately 100GB in size. To re-create the partition tables, I used the parted mkpart command passing it the starting and ending sector numbers I wanted written to the partition table:

$ mkpart /dev/sdb /dev/sdb1 /dev/sdb9

(parted) mkpart primary ext3 128s 198322424s

(parted) mkpart primary ext3 198322425s 209712509s

Now you may be asking yourself where did I get the starting and ending sectors from? In my specific case I start the partition tables at sector 128 to ensure alignment with the storage arrays we use, and the end sector was calculated by taking the partion table entry from /proc/partitions and multiplying it by 2 (the /proc/partitions sizes are in 1k chunks). I also had to add the 128 sector offset to this value.

To get the values for sdb2, I added one to the number produced above, and then multiplied the sdb2 value from /proc/partitions by two to get the following numbers:

198322424 + 1 = 198322425 start sector or sdb2

198322425 + 11390084 = 209712509 end sector of sdb2

Once I wrote out my changes, I verified that they were correct with sfdisk:

$ sfdisk -uS -l /dev/sdb /dev/sdb1 /dev/sdb9

Disk /dev/sdb: 13054 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdb1 128 198322424 198322297 83 Linux
/dev/sdb2 198322425 209712509 11390085 83 Linux
/dev/sdb3 0 - 0 0 Empty
/dev/sdb4 0 - 0 0 Empty

And verified that /proc/partitions matched up with what was previously there:

$ grep sdb /proc/partitions

8 16 104857600 sdb
8 17 99161148 sdb1
8 18 5695042 sdb2

Now that the partition table was restored, I used the fsck “-n” option to verify the file system was complete (make sure to read the fsck manual page before running it, as the “-n” option WILL modify some file systems!!):

$ fsck -n /dev/sdb1

fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/sdb1: clean, 11/12402688 files, 437281/24790287 blocks

$ fsck -n /dev/sdb2

fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/sdb2: clean, 11/712448 files, 57952/1423760 blocks

And then proceeded to mount them up:

$ mount /dev/sdb1 /a

$ mount /dev/sdb2 /bin /boot

$ df -h | grep sdb

/dev/sdb1 94G 188M 89G 1% /a
/dev/sdb2 5.4G 139M 5.0G 3% /b

I’m not a data recovery person / service (just a guy who reads a lot and understands the internals of how file systems and devices are structured), so use this information at your own risk!