Measuring the time an application was stopped due to garbage collection -- Prefetch Technologies

I recently spent some of my spare time assisting a friend with debugging some Java performance problems his company was experiencing. When I originally looked into the performance problem several weeks back, I used the mpstat and jstat utilities to observe CPU utilization and object allocations, and based on some jstat anomalies, I used the techniques described in my object allocation post to get a better understanding of how their Java application was allocating objects. After a bit of analysis and a a couple of email exchanges with my friend and one of the developers he worked with, we were able to locate two application problems that the developer has since fixed.

But even with these changes (which resulted in some significant speedups!!), my friend noticed that request times would periodically shoot up to unacceptable levels. After a bit more analysis with jstat, I noticed that the rate of object allocation in the new generation was still relatively high, and started to wonder if the current size of the new generation was limiting throughput. To see if this was the case, I had my friend add the “PrintGCApplicationConcurrentTime” and “PrintGCApplicationStoppedTime” options to the Java command lline:

$ java -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime ...

These options will instruct the Java process to print the time an application is actually running, as well as the time the application was stopped due to GC events. Here are a few sample entries that were produced:

$ egrep '(Application|Total time)' gc.log |more

Application time: 3.3319318 seconds
Total time for which application threads were stopped: 0.7876304
seconds
Application time: 2.1039898 seconds
Total time for which application threads were stopped: 0.4100732
seconds
.....

To get a better idea of how much time the application was running vs. stopped, I created a script (gctime) to summarize the log data. Gctime takes a GC log as input, and prints the total execution time, the time the application ran, the time the application was stopped, as well as the percentage of time the application spent in the running and stopped states. Here is some sample output from a short run:

$ gctimes gc.log

Total execution time : 66.30secs
Time application ran : 55.47secs
Time application was stopped : 10.84secs
% of time application ran : 83.65%
% of time application was stopped : 16.35%

Based on the results above, the fact that objects were “spilling” into the old generation, as well as an observation that the tenuring threshold for most objects were extremely low, it appeared that increasing the size (they were using the default size) of the new generation would help decrease the time the application was paused. I asked my friend to double the size (the size for each Java generation should be chosen carefully based on the results of empirical testing methods) of the “NewSize” and “MaxNewSize” runtime options, and that appears to have fixed their latency problem. As I research the area of Java performance more and more, I am starting to realize that a myriad of factors can lead to poor performance. I hope to share some additional Java performance monitoring tools I have written in future posts.

This article was posted by Matty on 2008-02-18 19:37:00 -0400 -0400