Investigating "Mysterious" HotSpot Behavior

tech

java openjdk internals gc

Yo! I am preparing a new talk, to be first presented at the upcoming Joker Conference in Saint Petersburg, Russia. With it, I intend to make everyone capable of digging deep into HotSpot to get to the bottom of their issues.

This blog entry is a teaser which examines a sample problem encountered by Nikita of Plumbr.

What does GC cause _no_gc mean? and _last_ditch_collection too. #JVM /cc @shipilev ?

Last Ditch Collection

One of the most awesome things about HotSpot is that it is almost completely built from OpenJDK. So, we have virtually no restrictions in our investigation. Let’s fetch the latest version and do a grep:

$ grep -rn "_last_ditch_collection" .
...
./src/share/vm/gc_implementation/shared/vmGCOperations.cpp:286:        heap->collect_as_vm_thread(GCCause::_last_ditch_collection);
...

There we are! Reading the enclosing method gives us an understanding of what’s happening:

// Don't clear the soft refs yet.
heap->collect_as_vm_thread(GCCause::_metadata_GC_threshold);
// After a GC try to allocate without expanding.  Could fail
// and expansion will be tried below.
// If still failing, allow the Metaspace to expand.
// This should work unless there really is no more space
// or a MaxMetaspaceSize has been specified on the command line.
_result = _loader_data->metaspace_non_null()->expand_and_allocate(_size, _mdtype);
// If expansion failed, do a last-ditch collection and try allocating
// again.  A last-ditch collection will clear softrefs.
heap->collect_as_vm_thread(GCCause::_last_ditch_collection);

When HotSpot finds itself unable to allocate memory for something that should be stored in metaspace (e.g., a class), it first runs a GC (GCCause::_metadata_GC_threshold). If that doesn’t help, an attempt to expand metaspace is made. If that still doesn’t help, a full GC is triggered with reason GCCause::_last_ditch_collection. Soft references are cleaned up at this point. If there is still insufficient room, then an OOM is thrown.

Sounds good, but let us write some actual code that gets this behavior consistently. Leaking a lot of stuff to metaspace should, by this logic, result in at least one last ditch collection:

ClassPool pool = ClassPool.getDefault();
for(long l = 0; l < Long.MAX_VALUE; l++) {
    pool.makeClass("com.example.Kitty" + l).toClass();
}

Running this with default params will take quite a lot of time before shit hits the fan, but we can reduce the metaspace size to speed up the process:

$ java -cp build/libs/labs-8-all.jar -XX:+PrintGCDetails -XX:MaxMetaspaceSize=10m ru.gvsmirnov.perv.labs.gc.NoGcTrigger
...
[Full GC (Last ditch collection) [...] 14470K->14425K(32768K), [Metaspace: 8971K->8971K(1056768K)], 0.0481030 secs]
...
Exception in thread "main" javassist.CannotCompileException: by java.lang.OutOfMemoryError: Metaspace
	at javassist.ClassPool.toClass(ClassPool.java:1099)
	...

Note that usually, more than one last ditch collection is observed. It is expected, since… Well, otherwise there wouldn’t be much point doing it in the first place.

Also, we were using a Java 9 bundle, but the logic of last ditch collection has not changed for quite a long while, so we’re safe in that regard.

Another fun thing to note is that we did not really need to leak to metaspace all by ourselves. JVM does a good job of loading a lot of classes on its own, so if we just run java -XX:+PrintGCDetails -XX:MaxMetaspaceSize=2m X, we will get some last ditch collections and an OOM before JVM even notices that class X does not exist.

No GC

Now, “No GC” is a bit more tricky. Obviously, the HotSpot developers would not in their sane minds intentionally make this a valid GC cause. Actually, the only two reasonable places this could appear is if we run jstat before any GC has occurred:

$ jstat -gccause -t `jps | grep NoGc | cut -d' ' -f1` 100

Timestamp     S0     S1      E      O      P    YGC     YGCT   FGC     FGCT      GCT     LGCC     GCC
      0.7   0.00   0.00   4.00   0.00  40.34      0    0.000     0    0.000    0.000    No GC   No GC
      0.8   0.00   0.00   6.00   0.00  45.82      0    0.000     0    0.000    0.000    No GC   No GC
      0.9   0.00   0.00   6.00   0.00  51.44      0    0.000     0    0.000    0.000    No GC   No GC

Or when it samples at a time when there is no garbage collection activity:

Timestamp     S0     S1      E      O      P    YGC     YGCT   FGC     FGCT      GCT                 LGCC     GCC
      2.8   0.00   0.00   0.00   0.78  31.00      5    0.080     5    1.051    1.131   Allocation Failure   No GC

So, if Nikita saw this in a different situation, this is very likely a bug. Seeing as we did not observe this issue while investigating the last ditch collection, it is possible that the bug has been fixed already. Let’s be optimistic and look for all the relevant changesets:

$ hg grep --all '::_no_gc'
...
src/share/vm/gc_implementation/shared/vmGCOperations.cpp:2097:+:  assert(((_gc_cause != GCCause::_no_gc) &&
...

Apparently, the bug was fixed in changeset 2097 on February the 3rd in 2011. According to make/hotspot_version, the latest HotSpot build that is affected is 21.0-b01.

I had a suitable Java 6 installation at hand:

$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

Which immediately produced the following result:

Timestamp     S0     S1      E      O      P    YGC     YGCT   FGC     FGCT      GCT                 LGCC                     GCC
      2.7   0.00   0.00   0.00   0.78 100.00      5    0.080     5    1.051    1.131      unknown GCCause      Allocation Failure
      2.8   0.00   0.00   0.00   0.78 100.00      5    0.080     5    1.051    1.131      unknown GCCause      Allocation Failure

This is close, but not quite what we were looking for. However, upon a closer examination of the changeset, it will become apparent that setting -XX:+UseG1GC should give us the result we need:

Timestamp     S0     S1      E      O      P    YGC     YGCT    FGC    FGCT      GCT                  LGCC                        GCC
      1.6      ?      ?      ?      ?  96.78      ?        ?     ?        ?        ?   G1 Evacuation Pause                      No GC
      1.7      ?      ?      ?      ? 100.00      ?        ?     ?        ?        ?   No GC                Permanent Generation Full
      1.8      ?      ?      ?      ? 100.00      ?        ?     ?        ?        ?   No GC                Permanent Generation Full
      1.9      ?      ?      ?      ? 100.00      ?        ?     ?        ?        ?   No GC                Permanent Generation Full
      2.0      ?      ?      ?      ? 100.00      ?        ?     ?        ?        ?   No GC                Permanent Generation Full

Let the question marks and what gave G1 away remain an exercise for the reader ;)

Summary

We have just seen that, given just a short tweet to go on and using just the basic command line tools, we can identify a HotSpot bug hidden in the depths of time, put and upper bound on the JVM version that the issue has arisen at, and even learn what GC algorithm was in use at the time.

The bottom line is that confusion exists in our minds, not in reality.

Call For Action

This is all nice and cool, but I need more real life examples of puzzling HotSpot behavior. You are welcome to send them over to me by email, twitter or in the comments below. It is quite probable that I will investigate your issue, free of charge!


P.S. Thanks to Nikita for giving this sample and proof-reading the post! (Any mistakes are my own, though)

comments powered by Disqus

CC0 Freely use anything from this website in any way you can imagine