Getting alerts when Java processes crash

When bugs occur in the Java runtime environment, most administrators want to get notified so they can take corrective action. These actions can range from restarting a Java process, collecting postmortem data or calling in application support personnel to debug the situation further. The Java runtime has a number of useful options that can be used for this purpose. The first option is “-XX:OnOutOfMemoryError”, which allows a command to be run when the runtime environment incurs an out of memory condition. When this option is combined with the logger command line utility:

$ java -XX:OnOutOfMemoryError=”logger Java process %p encountered an OOM condition” …

Syslog entries similar to the following will be generated each time an OOM event occurs:

Jan 21 19:59:17 nevadadev root: [ID 702911 daemon.notice] Java process 19001 encountered an OOM condition

Another super useful option is “-XX:OnError”, which allows a command to be run when the runtime environment incurs a fatal error (i.e., a hard crash). When this option is combined with the logger utility:

$ java -XX:OnError=”logger -p Java process %p encountered a fatal condition” …

Syslog entries similar to the following will be generated when a fatal event occurs:

Jan 21 19:52:17 nevadadev root: [ID 702911 daemon.notice] Java process 19004 encountered a fatal condition

The options above allow you to run one or more commands when these errors are encountered, so you could chain together a postmortem debugging tool, a utility (logger or mail) to generate alerts, and a restarter script to start a new Java process (this assumes you aren’t using SMF). Nice!

Monitoring Java garbage collection with jstat

Java memory management revolves around the garbage collector, which is the entity responsible for traversing the heap and freeing space that is being taken up by unreferenced objects. Garbage collection makes life easier for Java programmers, since it frees them from having to explicitly manage memory resources (this isn’t 100% true, but close enough). In the Java runtime environment, there are two types of collections that can occur. The first type of collection is referred to as minor collection. Minor collections are responsible for locating live objects in the young generation (eden), copying these objects to the inactive survivor space, and moving tenured objects from the active survivor space to the old (tenured) generation (this assumes that a generational collector is being used). The second form of collection is the major collection. This type of collection frees unreferenced objects in in the tenured generation, and optionally compacts the heap to reduce fragmentation.

When debugging performance problems, it is extremely useful to be able to monitor object allocations and frees in the new and old generations. The Java development kit comes with the jstat utility, which provides a ton of visibility into what the garbage collector is doing, as well as a slew of information on how each generation is being utilized. To use jstat to display garbage collection statistics for the new, old and permanent generations, jstat can be invoked with the “-gc” (print garbage collection heap statistics) option, the “-t” (print the total number of seconds the JVM has been up) option, the process id to retrieve statistics from, and an optional interval to control how often statistics are printed:

$ jstat -gc -t `pgrep java` 5000

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
        98772.0 1600.0 1600.0  0.0   1599.8 13184.0   5561.6   245760.0   201671.9  16384.0 6443.0 166683 2402.690 32411  110.564 2513.255
        98777.0 1600.0 1600.0 1599.4  0.0   13184.0   9533.7   245760.0   156797.1  16384.0 6443.0 166690 2402.785 32414  110.573 2513.359
        98782.0 1600.0 1600.0 1599.7  0.0   13184.0  10328.6   245760.0   166402.2  16384.0 6443.0 166698 2402.889 32416  110.580 2513.469
        98787.0 1600.0 1600.0  0.0   1599.9 13184.0   2383.5   245760.0   195366.0  16384.0 6443.0 166707 2403.016 32416  110.580 2513.595


The output above contains the size of each survivor space (S0C && S1C), the utilization of each survivor space (S0U && S1U), the capacity of eden (EC), the utilization of eden (EU), the capacity of the old generation (OC), the utilization of the old generation (OU), the permanent generation capacity (PC), the permanent generation utilization (PU), the total number of young generation garbage collection events (YGC), the total amount of time spent collecting objects in the new generation (YGCT), the total number of old generation garbage collection events that have occurred (FGC), the total amount of time spent collecting objects in the old generation (FGCT), and the total time spent performing garbage collection.

If you prefer to view garbage collection events as percentages, you can use the “-gcutil” option:

$ jstat -gcutil -t -h5 `pgrep java` 5000

Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
        99814.1   0.00  99.99  18.08  63.77  39.32 168551 2427.512 32800  111.800 2539.313
        99819.1  99.96   0.00  66.29  78.18  39.32 168562 2427.649 32800  111.800 2539.449
        99824.1 100.00   0.00  94.40  62.46  39.32 168572 2427.795 32803  111.815 2539.610
        99829.2 100.00   0.00  60.25  65.08  39.32 168580 2427.888 32806  111.824 2539.713



The output above contains the utilization of each survivor space as a percentage of the total survivor space capacity (S0 && S1), the utilization of eden as a percentage of the total eden capacity (E), the utilization of the tenured generation as a percentage of the total tenured generation capacity (O), the utilization of the permanent generation as a percentage of the total permanent generation capacity (P), the total number of young generation garbage collection events (YGC), the total time spent collection objects in the young generation (YGCT), the total number of of old generation garbage collection events (FGC), the total amount of time spent collecting objects in the old generation (FGCT), and the total garbage collection time.

To get the time spent in garbage collection along with the reason the collection occurred, jstat can be run with the “-gccause” option:

$ jstat -gccause -t `pgrep java` 1000

Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC                 
       100157.3  99.96   0.00  66.27  63.82  39.32 169160 2435.394 32925  112.202 2547.595 CMS Initial Mark     No GC               
       100158.3   0.00  99.99  32.14  67.72  39.32 169163 2435.430 32925  112.202 2547.631 unknown GCCause      No GC               
       100159.3   0.00  99.97  50.22  65.10  39.32 169165 2435.454 32927  112.208 2547.662 CMS Initial Mark     No GC               
       100160.3  99.97   0.00   6.02  62.46  39.32 169168 2435.493 32928  112.211 2547.704 unknown GCCause      No GC               
       100161.3  99.97   0.00  32.14  62.46  39.32 169168 2435.493 32928  112.211 2547.704 unknown GCCause      No GC      



There are also options to print class loader activity and hotspot compiler statistics, and to break down utilization by generation (this is extremely useful when your trying to profile a specific memory pool). There are a number of incredibly useful opensource tools for visualizing garbage collection data, and I hope to talk about these in the near future.

Zone update on attach functionality

If you’ve used the zone migration features (e.g., attach and detach) in Solaris, you may have bumped into issues when you tried to migrate a zone from one machine to another, and the servers didn’t have the same set of patches or packages installed. With Jerry’s putback of PSARC 2007/621 into opensolaris, this should be a thing of the past. Here are the bugs that were addressed by PSARC 2007/621:

PSARC 2007/621 zone update on attach
6480464 RFE: zoneadm attach should patch/update the zone to the new hosts level
6576592 RFE: zoneadm detach/attach should work between sun4u and sun4v architecture
6637869 zone attach doesn’t handle obsolete patches correctly

Thanks Jerry for this super super useful feature!

Sun tech days recap

This week I was fortunate to attend Sun tech days. I had a great time attending the technical sessions, and discussing a wide variety of technology topics with other admins and Java developers. Some of the highlights from tech days included a 4-hour training session on web 2.0 technologies (AJAX, REST, CSS, MAKI, etc.), a session on the AMD Opteron processor, an interesting talk on new features in Netbeans 6, and a thought provoking session on using the Netbeans profiling tools to gain greater visibility into how Java applications are using memory and CPU resources (I came out of this session with a number of new ideas on how to profile applications, which I will share with others once I have working code). I was also able to attend a Q&A with James Gosling, but was disappointed that the presenter didn’t ask some of the questions I had submitted.

But my favorite part of tech days had to be the opensolaris users group meeting, which included a presentation by Ian Murdock. Ian is a recent addition to the opensolaris community, and his thoughts on how to make Solaris and opensolaris more accessible to the common user are spot on! Ian has a bumpy road ahead of him to achieve his vision, and to get the opensolaris community to work together towards a common goal. Hopefully people will take the time to listen to Ian before judging him. Not only does he have the virtues (e.g., proven leadership, good listener, open minded, etc.) required to lead the monumental task of restructuring the opensolaris community for the better, but he has a number of awesome ideas for expanding it and making it more accessible to folks who have written off opensolaris. Welcome to the opensolaris community Ian!

Jumpstarting VMWare fusion clients

I updated my jumpstart server last weekend, and wanted to test out the new bits I added. All of the systems I use for testing were tied up, so I decided to jumpstart a VMWare fusion client. By default, VMWare fusion will create and present an Intel pro series adapter to the guest. There appears to be a bug in VMWare fusion that slows PXE booting on clients that use this adapter to a craaaaawl. For a reason yet undetermined, switching to the vlance driver seems to work around whatever issue is present in VMWare fusion. To switch drivers, you can edit your guest’s VMX file and replace the existing Ethernet adapter settings with the following:

ethernet0 = “vlance”

After the adapter setting is updated, things should be significantly faster (at least they were for me!).