I hope everyone got a chance to celebrate the international day of awesomeness yesterday! I was fortunate enough to celebrate the international day of awesome with a friend, and had a blast doing so! Awesome!

Posted by matty, filed under Uncategorized. Date: March 11, 2008, 5:01 pm | No Comments »

If you are running ZFS in production, you may have experienced a situation where your server paniced and reboot when a ZFS file system was corrupted. With George Wilson’s recent putback of CR #6322646, this is no longer the case. George’s putback allows the file system administrator to set the “failmode” property to control that happens when a pool incurs a fault. Here is a description of the new property from the zpool(1m) manual page:

failmode=wait | continue | panic

    Controls the system behavior  in  the  event  of  catas-
    trophic  pool  failure.  This  condition  is typically a
    result of a  loss  of  connectivity  to  the  underlying
    storage device(s) or a failure of all devices within the
    pool. The behavior of such an  event  is  determined  as
    follows:

    wait        Blocks all I/O access until the device  con-
                nectivity  is  recovered  and the errors are
                cleared. This is the default behavior.

    continue    Returns EIO to any new  write  I/O  requests
                but  allows  reads  to  any of the remaining
                healthy devices.  Any  write  requests  that
                have  yet  to  be committed to disk would be
                blocked.

    panic       Prints out a message to the console and gen-
                erates a system crash dump.



To see just how well this feature worked, I decided to test out the new failmode property. To begin my tests, I created a new ZFS pool from two files:

$ cd / && mkfile 1g file1 file2

$ zpool create p1 /file1 /file2

$ zpool status

  pool: p1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        p1          ONLINE       0     0     0
          /file1    ONLINE       0     0     0
          /file2    ONLINE       0     0     0



After the pool was created, I checked the failmode property:

$ zpool get failmode p1

NAME  PROPERTY  VALUE     SOURCE
p1    failmode  wait      default



And then then began writing garbage to one of the files to see what would happen:

$ dd if=/dev/zero of=/file1 bs=512 count=1024

$ zpool scrub p1

I was overjoyed to find that the box was still running, even though the pool showed up as faulted:

$ zpool status

  pool: p1
 state: FAULTED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Tue Feb 19 13:57:41 2008
config:

        NAME        STATE     READ WRITE CKSUM
        p1          FAULTED      0     0     0  insufficient replicas
          /file1    UNAVAIL      0     0     0  corrupted data
          /file2    ONLINE       0     0     0

errors: No known data errors



But my joy didn’t last long, since the box became unresponsive after a few minutes, and paniced with the following string:

Feb 19 13:57:47 nevadadev genunix: [ID 603766 kern.notice] assertion failed: vdev_config_sync(rvd->vdev_child, rvd->vdev_children, txg) == 0 (0x5 == 0x0), file: ../../common/fs/zfs/spa.c, line: 4130
Feb 19 13:57:47 nevadadev unix: [ID 100000 kern.notice]
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feab30 genunix:assfail3+b9 ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feabd0 zfs:spa_sync+5d2 ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feac60 zfs:txg_sync_thread+19a ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feac70 unix:thread_start+8 ()



Since the manual page states that the failmode property “controls the system behavior in the event of catas-trophic pool failure,” it appears the box should have stayed up and operational when the pool became unusable. I filed a bug on the opensolaris website, so hopefully the ZFS team will get this issue addressed in the future.

Posted by matty, filed under Solaris ZFS. Date: March 1, 2008, 12:05 pm | 2 Comments »

I use jumpstart at home to update the hosts in my lab as new Nevada builds and Solaris updates are released. As part of the unattended installation / upgrade process, I create a couple of ZFS file systems on each system. Since jumpstart doesn’t have built-in support for creating ZFS file system, I had to add the zpool and zfs commands to my finish script. After a bit of tinkering around, here is what I came up with:

# Locate the first local device on the system
DISK1=`echo quit | /usr/sbin/format 2>/dev/null | /usr/bin/awk '$0 ~ /0./ { print $2 }'`

# Create a ZFS pool using the disk device above
/usr/sbin/zpool create -f -R /a p0 ${DISK1}

# Create ZFS file systems
/usr/sbin/zfs create p0/home
/usr/sbin/zfs set mountpoint=/home p0/home
.....

This appears to work pretty well, and my boxes are now built and operational once the jumpstart process completes. Niiiiiiiice!

Posted by matty, filed under Solaris ZFS. Date: February 25, 2008, 12:35 am | 3 Comments »

If you are an avid Solaris 10 user, you may have noticed that it takes a bit of time for global and non-global zones to initialize after they are installed. One of the issues that slows down the initialization process is the initial manifest import, which is a series of steps that takes place to populate the repository with one or more service manifests. The enhanced profiles project is currently tasked with permanently addressing this issue, but Steve Peng just putback CR #6351623 (Initial manifest-import is slow) to provide some temporary relief. This is good stuff, and I can’t wait to get my hands on Nevada build 84 to see just how well his tmpfs solution works! If anyone has tested Steve’s solution, or copied over their own repository during system initialization, please leave a comment or shoot me an email. I am curious to see just how speedy these solutions are!!

Posted by matty, filed under Solaris SMF. Date: February 20, 2008, 2:14 am | 2 Comments »

One of my friends recently purchased an iRobot Roomba, and he let me test it out while he was out of town. I thoroughly tested out my friend’s Roomba, and was amazed that it was able to do just as good of a job as my existing vacuum at cleaning. The Roomba also has a key advantage over my upright vacuum cleaner in that it could wander under couches, beds and dressers to get dust and debree that had made its way there. I was also overjoyed when I found out that once I pushed the CLEAN button on the Roomba, it would begin vacuuming the room with no manual intervention.

These things made me realize that a Roomba was in my future, and I ordered one once my friend returned. My Roomba is now operational, and I have it programmed to vacuum my carpets every other day. The price tag for the unit was a bit high, but the following benefits quickly made me realize that this was the right things for me:

  • Vacuming an entire room is one button push away
  • Cleaning the Roomba is extremely easy
  • I have seen a dramatic decrease in the amount of dust
  • It can get under couches, beds and dressers than normal vacuums can’t
  • The anti-tangle technology works extremely well

Having now owned my Roomba for two months, I have only found two downsides. First, the replacement parts (brushes, filters, etc.) are not exactly cheap, and the time it takes to get them is somewhat lengthy. But even when I factor in the upkeep costs, this has to be one of the best purchased I have EVER made! Long live the Roomba (and Chris, you sparked my interest in the Scooba)!

Posted by matty, filed under Rants. Date: February 18, 2008, 8:09 pm | 2 Comments »

I recently spent some of my spare time assisting a friend with debugging some Java performance problems his company was experiencing. When I originally looked into the performance problem several weeks back, I used the mpstat and jstat utilities to observe CPU utilization and object allocations, and based on some jstat anomalies, I used the techniques described in my object allocation post to get a better understanding of how their Java application was allocating objects. After a bit of analysis and a a couple of email exchanges with my friend and one of the developers he worked with, we were able to locate two application problems that the developer has since fixed.

But even with these changes (which resulted in some significant speedups!!), my friend noticed that request times would periodically shoot up to unacceptable levels. After a bit more analysis with jstat, I noticed that the rate of object allocation in the new generation was still relatively high, and started to wonder if the current size of the new generation was limiting throughput. To see if this was the case, I had my friend add the “PrintGCApplicationConcurrentTime” and “PrintGCApplicationStoppedTime” options to the Java command lline:

$ java -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime …

These options will instruct the Java process to print the time an application is actually running, as well as the time the application was stopped due to GC events. Here are a few sample entries that were produced:

$ egrep ‘(Application|Total time)’ gc.log |more
Application time: 3.3319318 seconds
Total time for which application threads were stopped: 0.7876304 seconds
Application time: 2.1039898 seconds
Total time for which application threads were stopped: 0.4100732 seconds
…..

To get a better idea of how much time the application was running vs. stopped, I created a script (gctime) to summarize the log data. Gctime takes a GC log as input, and prints the total execution time, the time the application ran, the time the application was stopped, as well as the percentage of time the application spent in the running and stopped states. Here is some sample output from a short run:

$ gctimes gc.log

Total execution time               : 66.30secs
Time application ran               : 55.47secs
Time application was stopped       : 10.84secs
% of time application ran          : 83.65%
% of time application was stopped  : 16.35%



Based on the results above, the fact that objects were “spilling” into the old generation, as well as an observation that the tenuring threshold for most objects were extremely low, it appeared that increasing the size (they were using the default size) of the new generation would help decrease the time the application was paused. I asked my friend to double the size (the size for each Java generation should be chosen carefully based on the results of empirical testing methods) of the “NewSize” and “MaxNewSize” runtime options, and that appears to have fixed their latency problem. As I research the area of Java performance more and more, I am starting to realize that a myriad of factors can lead to poor performance. I hope to share some additional Java performance monitoring tools I have written in future posts.

Posted by matty, filed under Java. Date: February 18, 2008, 7:37 pm | 1 Comment »

Most Linux and BSD distributions ship with the locate utility, which allows you to quickly find files on a system:

$ locate pvcreate
/usr/sbin/pvcreate
/usr/share/man/man8/pvcreate.8.gz

While not quite as thorough as locate, the Solaris pkgchk utility has a “-P” option that provides similar capabilities:

$ pkgchk -l -P metastat | grep Pathname
Pathname: /sbin/metastat
Pathname: /usr/sbin/metastat
Pathname: /usr/share/man/man1m/metastat.1m

Nice!

Posted by matty, filed under Solaris Patching. Date: February 18, 2008, 7:20 pm | 2 Comments »

I have read a number of documents on correctly using CSS and XHTML over the past month, and have learned about a number of common mistakes people make when creating content that uses these technologies. Most of the articles discussed ways to structure web content to avoid these pitfalls, which got me wondering if anyone had taken these recommendations and created a tool to analyze content for errors. After a bit of googling, I came across the W3C content validation site, as well as the tidy utility.

The W3C website is super easy to use, and it provides extremely useful feedback that you can use to improve your content. The tidy utility provides similar capabilities, but has options to actually correct errors it finds in the files it analyzes. Tidy can be downloaded from sourceforge, or installed with your favorite package utility (the CentOS repositories contain tidy, so it’s a yum install way). Once tidy is installed, you can pass the name of one or more files to analyze as arguments:

$ tidy –indent index.html
line 8 column 1 - Warning: <link> isn’t allowed in elements
line 3 column 1 - Info: <html> previously mentioned
line 74 column 28 - Warning: unescaped & which should be written as &
line 74 column 29 - Warning: unescaped & which should be written as &
line 191 column 15 - Warning: discarding unexpected </h2>
line 181 column 9 - Warning: <a> escaping malformed URI reference
Info: Doctype given is “-//W3C//DTD XHTML 1.0 Strict//EN”
Info: Document content looks like XHTML 1.0 Transitional
5 warnings, 0 errors were found!

<HTML FILE CONTENTS WITH FIXES APPLIED>

The tidy output will contain the list of errors it detected as well as the corrected HTML code. This is amazingly cool, and it has tipped me off to a few issues with some of the XHTML files that I am using to support my website. Tidy and the W3C validation site are incredibly useful which will hopefully enhance the experience for individuals who access W3C validated content.

Posted by matty, filed under Linux Utilities. Date: February 16, 2008, 4:23 pm | No Comments »

« Previous Entries Next Entries »