Performance

A close look at Java’s JIT: Don’t Waste Your Time on Local Optimizations

At this year’s JAX I attended a talk Charles Nutter held. He explained some of the optimizations Java’s just in time compiler (aka JIT) does. After hearing the talk, I began to experiment with the JIT myself, doing some benchmarks and playing with the VM’s parameters.

Before explaining what the JIT does, it is interesting to do a little experiment. Consider the following class:

public class PerformanceTest2 {
    public static void main(String[] args) {
        for (int outer=1;outer<=100;outer++)
        {
            long start = System.nanoTime();
            testPerformance();
            long duration = System.nanoTime()-start;
            System.out.println("Loop # " + outer + " took " + ((duration)/1000.0d) + " µs");
        }
    }

    private static void testPerformance() {
        long sum = 0;
        for (int i = 1; i <= 5000; i++)
        {
            sum = sum + random(i);
        }
    }

    private static int random(int i) {
        int x = (int)(i*2.3d/2.7d); // This is a simulation
        int y = (int)(i*2.36d);     // of time-consuming
        return x%y;                 // business logic.
    }
}

This class consists of two nested loops and measures the time each look takes. The result is an impressive example of the optimizations of the JVM:

Our particular example shows the JIT behavior very clearly. In most cases the first two iterations are comparably slow. Beginning with the third iteration, there is a tremendous speedup. Until the tenth repetition (or – in some examples – the ten thousendth repetition) each repetition is faster than the previous one. After that, things don’t change that much. Each iteration is up to 50 times as fast as the first iteration1. Nonetheless, the JIT never stops to try to improve the code. After playing around a lot with the benchmark code and the JVM parameters2 I was able to generate a smoother diagram that shows the ongoing optimization:

Let’s optimize the java code manually. An optimization that comes to mind immediately is inlining. If you call a method, the parameters are put on the stack (just to name one operation), so it is evident that getting rid of a method improves the performance. The optimized code might look like this:

public class PerformanceTest3 {
    public static void main(String[] args) {
        for (int outer=1;outer<=100;outer++)
        {
            long start = System.nanoTime();
            long sum = 0;
            for (int i = 1; i <= 5000; i++)
            {
                int x = (int)(i*2.3d/2.7d); // This is a simulation
                int y = (int)(i*2.36d);     // of time-consuming
                sum = sum + x%y;            // business logic.
            }
            long duration = System.nanoTime()-start;
            System.out.println("Loop # " + outer + " took " + ((duration)/1000.0d) + " µs");
        }
    }
}

What about the performance charts?

Optimized version naive version

Well, this is not exactly what I hoped to see when I prepared the benchmarks. I wasn’t aware that Java OSR compilation mechanism works that fine. I expected the left-hand side graphic to be all blue. Nonetheless, there are enough differences between the two graphics to illustrate the point:

  • On the long run, there is hardly any performance difference between the naive and the optimized version. In many cases, the optimized version will even perform worse. Just compare the size of the blue areas.
  • The optimized version needs more time to speed up than the naive version.
  • On the long run, the manually optimized version is not faster than the naive version.
  • During the first few iterations, the optimized version is indeed faster than the naive version.
  • There are at least two major speedups on the right-hand side diagram. The left-hand side diagram stays at an almost constant performance until the algorithm gets 10 or 20 times faster.

The reason for the mediocre success of our optimization is that the virtual machine does the same steps as we did. The default behavior of the JIT is to optimize method calls.

  • If a method is frequently called, and if it is short enough, it will be inlined.
  • If a method is called 10000 times, it compiles to machine code. The next call doesn’t call the interpreted version but the compiled version.
  • If there is no method call, but the program spends a lot of time in a method, the method is compiled. After that the program is stopped, the variables are transferred to the machine code version of the method and the machine code version is started. This is a difficult operation called On Stack Replacement. Usually, the JIT tries to optimize everything else before starting OSR.

My conclusion is not to waste time on local optimizations. In most cases, we can’t beat the JIT. A sensible (yet counter-intuitive) optimization is to split long methods into several methods that can be optimized individually by the JIT.

You’d rather focus on global optimizations. In most cases, they have much more impact. You can rewrite our example like so:

public class PerformanceTest4 {
    public static void main(String[] args) {
        for (int outer = 1; outer <= 100; outer++) {
            long start = System.nanoTime();
            long sum = 10647704;
            long duration = System.nanoTime() - start;
            System.out.println("Loop # " + outer + " took "    + ((duration) / 1000.0d) + " µs");
        }
    }
}

The program delivers exactly the same result, but the iterations are so fast I can’t even measure them on my PC. Each iteration takes less than 0.3 µs, that is, it is at least 20 times quicker than the original JIT optimized version. And it didn’t even compile!


Further reading:

  • Charles Nutter uploaded his slides here.
  • You can watch a video of another nice explanation of the JIT here if you’ve got a spare hour (57 minutes, to be precise).

  1. Notice that usually even the first iteration is twice as fast than in interpreted mode. You can verify this by adding the VM parameter -Xint.
  2. -XX:CompileThreshold=120.000 instead of 10.000, forcing the VM to wait 12 times as long as usual to start compiling

7 thoughts on “A close look at Java’s JIT: Don’t Waste Your Time on Local Optimizations

  1. I guess that is important first to know some basic optimization techniques to be able to perform local optimizations. Knowing what Java can do for U allows to go ahead.

    All the benchmarks are based on System.nano something wich cannot be blindly trusted (http://shipilev.net/blog/2014/nanotrusting-nanotime/)

    Also, the inline example is easily further optimizable using invariant code motion. Then, you can use unrolling because of the constant bonds (https://en.wikipedia.org/wiki/Loop_optimization)

    Even more, what is done with the value of the sum variable? The whole ‘for (int i = 1; i <= 5000; i++)' can be marked as dead code.

    1. You’re right. In particular, I highly recommend the article “Nanotrusting the Nanotime” you’ve provided a link to. That article is just great.

      Actually, I don’t remember if I checked whether the JIT eliminates the dead code. I don’t think JIT is that clever. Be that as it may, even if JIT eliminates the loop, the idea of the article still holds: don’t waste your time on local optimizations in Java, unless you’re really desperated. The JIT is pretty proficient at low-level optimizations, and it does them only when needed. It doesn’t pay to optimize code that’s rarely called.

      What does pay is high-level optimizations. Elimination the for loop is a classical example of such a high-level optimization. That’s something what we humans are really good at, and usually high-level optimization speed up your program much more than low-level optimizations.

      1. JIT does eliminate dead code. So basically your PerformanceTest4 you are measuring time taken to call System.nanoTime() twice 😉

        Now regarding your comment on if JIT is clever or not. Maybe worth reading a bit more on aggressive optimisations or speculative optimisations. It’ll surprise you for sure 🙂 Have fun!

        1. Thanks for correcting me. In the meantime, I’ve started to investigate what the JITs of the Java JVM and the Google V8 engine do. The idea is to hold a talk or two at a conference. If time allows, I’ll write an article about what I learned. Stay tuned!

  2. Have you considered that the JIT compiler decided that the call to testPerformance() has no side effects and no return value so the call and the work done inside the call eventually got eliminated and you end up an empty loop?

    I would have testPerformance() return the sum, then you can sum the sums and print at the end, therefore the computation cannot be eliminated.

    How can you convince me that the highly optimized JIT version didn’t simply reduce the looped computation to a null op?

    1. I don’t. Your question points into the right direction. Modern JVMs are sophisticated beasts. They do escape analysis, so it’s perfectly reasonable to assume they eliminate code without side-effects (or rather: code without any effect).

      There’s a simple way to answer the question: have a look at the assembly code generated by the JIT compiler.

      In any case, your theory sort of support the general morale of my article. Don’t waste your time on local optimizations, because the JIT compiler already does. Although I have to admit that eliminating code without effect is one of the bigger local optimizations :).

      1. You’ve piqued my curiosity. You’re right, if the method is short enough, the JVM recognizes that it doesn’t have any side effects. So the method is removed completely. You can observe this by removing the division from random(). The time measured drop almost to zero after the first few iterations. I’ve also checked the Assembly code. After the final optimization, the remaining machine code is very short.

        In our case, I’ve put enough code into the random() method to prevent that. But the method is inlined, and the loop is unrolled. So testPerformance() and random() are translated to roughly 300 lines of Assembly code. This code is remarkably complicated, so I don’t understand everything, but it’s easy to spot the division. There are four consecutive, almost identical repetitions of the body of the inner loop:

         0x00000001103b8d75: vmulsd -0x2f5(%rip),%xmm0,%xmm0        # 0x00000001103b8a88
                                                        ;   {section_word}
          0x00000001103b8d7d: vdivsd -0x2f5(%rip),%xmm0,%xmm0        # 0x00000001103b8a90
                                                        ;   {section_word}
          0x00000001103b8d85: vcvttsd2si %xmm0,%ebp
          0x00000001103b8d89: cmp    $0x80000000,%ebp
          0x00000001103b8d8f: jne    0x00000001103b8da0
        

        So I’m almost sure that the benchmark is complicated enough to prevent your scenario, at least in Java 8u144.

Leave a Reply

Your email address will not be published.