Concepts of programming languagesJIT

A Child’s Garden of Cache Effects

One of these days Joshua Bloch twittered a link pointing to “a child’s garden of cache effects”. Igor Ostrovsky demonstrates the effect of CPU caches using a high-level programming language, such as C#. Highly recommended!

When I read his article, I wondered if we can demonstrate the same effects in Java. At first glance, the odds are against us: Java isn’t compiled to native machine code, but to an intermediate byte code, which in turn may or may not be compiled to native machine code (depending on the circumstances). So, can we show cache effects in a Java program?

Yes, we can!

Beware of local optimizations

Before continuing, I’d like to warn you about local optimizations. Articles like this one show you how to optimize your program at a very elementary level. That’s interesting, and instructive, too, but most of the time it’s a waste of time to apply the knowledge to real-world applications. For one, real-world applications usually can be optimized much better by global optimizations. Most of the time you can replace the algorithm by a better algorithm. Second, local optimizations are the kind of optimization that can be done automatically be compilers. Let them do it. Finally, the next version of Java may change the rules of the game, rendering your optimization useless, sometimes even harmful.

However, this article gives you a better understanding of how Java runs on your processor.

A simple question

Igor Ostrovsky starts his article with a simple question: which of the following methods runs faster?

public class CacheLines {
  private static int[] array = new int[64 * 1024 * 10];

  private static void loop1() {
    int length = array.length;
    for (int i = 0; i < length; i=i+1)
      array[i] --;

  private static void loop2() {
    int length = array.length;
    for (int i = 0; i < length; i += 2)
      array[i] --;

To my surprise, the first method is the fast one. It does twice as much work, but it’s faster nonetheless.

I also implemented method with greater increment steps, all the way until loop128 which increments by 128. Here’s the result of my desktop PC:

Step   1/10000 took 1250.627 ms  that's  100% of the expected time
Step   2/10000 took 1668.458 ms  that's  266% of the expected time
Step   4/10000 took 1159.842 ms  that's  370% of the expected time
Step   8/10000 took 1039.095 ms  that's  664% of the expected time
Step  16/10000 took 1022.674 ms  that's 1308% of the expected time
Step  32/10000 took  530.95  ms  that's 1358% of the expected time
Step  64/10000 took  272.518 ms  that's 1394% of the expected time
Step 128/10000 took  138.968 ms  that's 1422% of the expected time
Step 256/10000 took  100.921 ms  that's 2065% of the expected time

Until step 16, the program runs more or less as fast as the step 1 version. That’s the effect of cache lines. I recommend reading Igor’s explanation – I couldn’t explain it better. Instead, I’d like to focus on the peculiarities of the JVM.

Inspecting the assembler code

One of the more interesting – and fairly unknown – features of the JVM is you can watch the optimizer at work, all the way down to machine code. Adding the JVM options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly rewards you with the source code of the method:

  0x00000000027f0f69: dec     dword ptr [r10+r13*4+10h]
                                                ; - CacheLines::loop1@18 (line 199)
  0x00000000027f0f6e: inc     r13d              ;*iinc
                                                ; - CacheLines::loop1@19 (line 198)
  0x00000000027f0f71: cmp     r13d,r9d
  0x00000000027f0f74: jl      27f0f69h          ;*if_icmplt
                                                ; - CacheLines::loop1@24 (line 198)

Truth to tell, the compiler emits a lot more code – 72 instructions in total (as of JDK 1.7.0 U55). There’s code to synchronize multiple threads, code to deoptimize the machine code if the assumptions of the JVM prove to be wrong and probably much more. However, these four instructions are all that counts. They are the for loop:

    for (int i = 0; i < length; i += 2)
      array[i] --;

Let’s analyze the code line by line. There are some preparations I didn’t show. When the CPU reaches the code snippet, it’s register1 r10 point to the start of the integer array. r13 is the counter, i.e. the variable i in the Java code. Its initial value is 0. r9d contains the length of the array.

  1. dec dword ptr [r10+r13*4+10h] That’s the most sophisticated line, and the line that surprised me most. The entire line array[i]-- can be expressed by a single opcode. r10+10h is the pointer to the byte array, r13*4 is the offset within the array. By convention, the square brackets refer to the memory cell indicate by the number within the brackets. So dec [r10+r13+10h] takes a value of the integer array, decrements it and writes it back to memory.
  2. inc r13d increments the loop counter (the variable i).
  3. cmp r13d,r9d checks if the end of the loops has been reached.
  4. jl 27f0f69h is a conditional goto. Depending on the previous check, the instruction pointer is set four instructions back, running the loop again.
  5. It’s a bit off that loop2() runs slower than loop1(), so we should check the machine code, too. Maybe there’s a weird optimization?

    0x00000000027f1508: dec     dword ptr [r11+rbx*4+10h]
                                                  ; - CacheLines::loop2@18 (line 205)
    0x00000000027f150d: add     ebx,2h            ;*iinc
                                                  ; - CacheLines::loop2@19 (line 204)
    0x00000000027f1510: cmp     ebx,r10d
    0x00000000027f1513: jl      27f1508h          ;*if_icmplt
                                                  ; - CacheLines::loop2@24 (line 204)

    The code is almost identical. The only difference is the loop counter isn’t incremented, but added. This requires us to us a different opcode, add, which is slightly less efficient than inc because it takes a parameter.

    So we can conclude there’s nothing wrong with the generated code. Quite the contrary, it’s as good as can be. Maybe a machine code programmer who’s diligently counting CPU cycles can optimize the code even better, but it’s already pretty good.

    Variable step counts

    A surprise expects us when we don’t use fixed step sizes, but put them in a variable:

      private static void testVariableStep(final int step) {
        long start3 = System.nanoTime();
        long time3 = System.nanoTime() - start3;
        cumulated3 += time3;
      private static void loopVariableStep(final int step) {
        int length = array.length;
        for (int i = 0; i < length; i += step)

    The program slows down considerably, at least for low step numbers. The first loop is four times slower than its static counterpart. Plus, the first three loops don’t take the same time, but they scale linearly. After that, there seems to be a cache effect before the times start to scale linearly again:

    Step   1/10000 took 5363.778 ms  that's 100% of the expected time
    Step   2/10000 took 2703.319 ms  that's 100% of the expected time
    Step   4/10000 took 1428.127 ms  that's 106% of the expected time
    Step   8/10000 took 1024.124 ms  that's 152% of the expected time
    Step  16/10000 took  985.169 ms  that's 293% of the expected time
    Step  32/10000 took  492.46  ms  that's 293% of the expected time
    Step  64/10000 took  250.679 ms  that's 299% of the expected time
    Step 128/10000 took  122.265 ms  that's 291% of the expected time
    Step 256/10000 took   69.627 ms  that's 332% of the expected time

    So my first guess was something’s wrong with the step parameter. Obviously the step variable needs to be read from main memory over and over again.

    Wrong. The step variable is stored in a CPU register (edx), so the algorithm still is blazing fast. The problem is something completely unexpected:

    0x00000000027404b0: cmp     r9d,r10d
    0x00000000027404b3: jnb     27404d4h          ;*iaload
                                                  ; - CacheLines::loopVariableStep@15 (line 295)
    0x00000000027404b5: dec     dword ptr [r11+r9*4+10h]
                                                  ; - CacheLines::loopVariableStep@25 (line 294)
    0x00000000027404ba: add     r9d,edx           ; OopMap{r11=Oop rbp=NarrowOop off=61}
                                                  ; - CacheLines::loopVariableStep@25 (line 294)
    0x00000000027404bd: test    dword ptr [120000h],eax  ;   {poll}
    0x00000000027404c3: cmp     r9d,r10d
    0x00000000027404c6: jl      27404b0h          ;*synchronization entry
                                                  ; - CacheLines::loopVariableStep@-1 (line 293)

    Two things are odd:

    • The comparison i < length is performed twice (cmp r9,r10d, followed by a conditional jump in lines 1-2 and 11-12). The first check raises an IndexOutOfBoundsException if necessary.
      The duplicate check doesn't explain the difference. It only adds a few CPU cycles. Advanced CPU techniques like out-of-order-execution and branch prediction might eliminate the performance penalty almost completely2.
    • There's a test that appears to do nothing.

    test compares the parameters and sets a number of CPU flags that can be used by a conditional branch. However, those flags are overwritten in the next line: cmp r9, r10d (aka i < length).

    The purpose of the test instruction is to add a so-called safe-point. Safe-points allow the JVM to perform tasks like garbage collection. So every once in a while the JVM adds instruction to call the VM thread responsible for garbage collection. Oracle's JVM does this in a cunning way: the test instruction accesses a memory cell in a virtual memory page that can be unmapped by the JVM. When the JVM thinks it's time to do same garbage collection, it unmaps the virtual memory page and waits until every thread has run into the exception. Read the full details in Alexey Ragozin's article on Safepoints in HotSpot JVM.

    Checking the safe-point flag doesn't make too much sense if the memory cell is cached. So each loop causes a cache miss, i.e. an access to real memory3, which in turn brings the performance down.

    Loop unrolling

    Loop unrolling is one of the ideas you should never implement yourself. That's a classical JVM task. However, in this case the JVM doesn't unroll the loop by itself, so it might be interesting to see what happens if we do it ourselves.

    private static void loopVariableStepWithLoopUnrolling(final int step) {
      int length = array.length;
      for (int i = 0; i < length;) {
        i += step;
        i += step;

    The result isn't all that convincing:

    Step 1/10000 took 3216.271 ms  that's 100% of the expected time
    Step 2/10000 took 1828.833 ms  that's 113% of the expected time
    Step 4/10000 took 1234.583 ms  that's 153% of the expected time
    Step 8/10000 took 992.342 ms  that's 246% of the expected time
    Step 16/10000 took 982.522 ms  that's 488% of the expected time
    Step 32/10000 took 495.244 ms  that's 492% of the expected time
    Step 64/10000 took 251.165 ms  that's 499% of the expected time
    Step 128/10000 took 122.601 ms  that's 487% of the expected time
    Step 256/10000 took 86.526 ms  that's 688% of the expected time

    There are fewer safe-point tests, so the loop runs faster for small step numbers, but the method gets slower for higher increments. That's probably due to the internal CPU architecture: it can optimize small loops better than larger ones4. Plus, the JIT doesn't know your intention. Looking at the code, I'm pretty sure an assembler programmer could optimize it considerably, making loop unrolling profitable. But Java programmers hardly ever succeed.

    What about an optimizing compiler?

    To my surprise neither SUN nor Oracle ever implemented an optimizing byte code compiler. They leave it to the JIT. JIT does a couple of optimizations - e.g. method inlining, escape analysis, loop unrolling, just name a few. JIT optimizes the code remarkably well. Unfortunately, it's limited by having to do the analysis at run time. A human programmer intuitively notices that the loop will never raise an ArrayIndexOutOfBoundsException. So they could omit the check cmp r9,r10d. I guess it's possible to implement an optimizing byte code compiler that does the same. A manually optimized version using loop unrolling looks considerably different from what the JIT generates:

                   mov     r12, 100h
    START_OF_LOOP: dec     dword ptr [r11+r9*4+10h]
                   add     r9d,edx           
                   ; we know there's no array bounds check needed!
                   dec     dword ptr [r11+r9*4+10h]
                   add     r9d,edx  
                   dec     r12
                   jnz     CHECK_END_OF_LOOP         
                   test    dword ptr [120000h],eax  ;   {poll}
                   mov     r12, 100h
                   cmp     r9d,r10d
                   jl      START_OF_LOOP

    Reality check

    Can you apply this knowledge in everyday programming? Hardly.

    The example I showed above is a very special case that rarely occurs in business code. I don't know about your work, but I for one usually deal with objects. The example doesn't work for object arrays: Java uses a slightly uncanny memory model for object arrays.

    An integer array is simply a chunk of memory that contains the array. The first integer uses the first free memory cell (bytes 16-19), the second integer the second free memory cell (byte 20-23) and so on.

    An object array looks similar. The difference is the array doesn't contain the actual objects, but pointers to the objects. The objects themselves are allocated when the programmer calls the new statement. In other words: there's no predictable memory layout of the objects. More likely than not, objects are distributed randomly across the heap memory. In other words: loops touching every object in an array most likely suffer from a lot of cache misses.

    By the way, that's one of the reasons why Brian Goetz and his team want to add value types to Java. Arrays of value types have a cache-friendly linear memory layout.

    Other cache effects

    Did you notice I covered only one cache effect? It's an interesting effect because cache lines are widely unknown, but there's much more to discover. Igor Ostrovsky also demonstrates the effect of cache size and of cache associativity. I couldn't reproduce his example in Java yet, so I didn't include them in this article. However, if you manage to demonstrate one of the other effects, drop me a comment so we can write a follow-up article.

    Wrapping it up

    One of the key points of Java is to keep the programmer away from the hardware (or is it the other way round? :)). It runs on a virtual machine, it compiles to byte code, and once upon a time it used to run as an interpreter language only.

    How times have changed! These days, it's possible to follow the processor all the way down to the machine programming level. While that doesn't make you a better programmer, it gives you some background information that might be useful in future.

    Dig deeper

    A child's garden of cache effects
    Safepoints in HotSpot JVM

    The other articles of this series:

    A Java Programmer’s Guide to Byte Code
    A quick guide on writing byte code with ASM
    A Java Programmer's Guide to Assembler Language
    A Child’s Garden of Cache Effects

    1. a high-performance memory cell within the CPU
    2. Correct me if I'm wrong - that's merely an educated guess
    3. or a higher-level cache - I didn't test that
    4. again, that's an educated guess: correct me if I'm wrong

Leave a Reply

Your email address will not be published.