; last updated - 9 minutes read

Spectre and Meltdown came as a shock. They showed that low-level CPU optimizations have an impact on our lives. Plus, they proved an illusion of most programmers wrong. Yes, you're using a high-level language. But that doesn't mean all those abstraction layers shield you completely from the CPU. I've demonstrated the effect of the CPU cache on a Java program some time ago.

Spectre and Meltdown took this to another level. If my sources are right (I still can hardly believe it!), it's possible to observe the effect of speculative execution in JavaScript. Speculative execution is a very low-level CPU optimization. JavaScript is a very high-level language - it doesn't even use bytecode, and the design of JavaScript makes compiling it a challenge. But that's another day's story.

The link between my click-bait introduction and the topic of the day is the RAM of your PC. CPUs have evolved much faster than main memory. Nowadays, CPUs spend most of their time waiting. A modern CPU can perform several instructions per CPU cycle. But if it has to access main memory, bypassing each of the three caches, it has to wait for roughly a hundred cycles. See this article on Extremetech.com for more details.

Escape analysis in Java

Since Java 6, the JVM uses a downright spectacular optimization technique to reduce the memory footprint. Funny thing is that there are many urban myths around this technique. They say escape analysis is a tool to move memory from the heap to the stack. That's nonsense, but it's interesting nonsense, so I'll cover this idea in this article nonetheless.

Some time ago, I wrote on this blog that escape analysis allows getting rid of the synchronized blocks. As far as I know, that's correct (although some people on StackOverflow.com say the opposite). But it's only half of the story.

As it turns out, escape analysis isn't an optimization technique. It's "just" an analysis tool allowing for interesting optimizations. Escape analysis examines a variable, looks at where it's used and detects whether it's used outside a certain scope. If it doesn't escape that scope, we know it's a local variable, and we can examine it more closely. Variables used only in a limited scope allow for many optimizations.

Lock elision

One of these optimizations is lock elision. Did you ever have trouble with multiple threads accessing a common variable? If so, you probably surround many parts of your program with a synchronized block. Sometimes you even have to do so, because you just don't know how your method is going to be used. A classical example if the StringBuffer implementation. The methods of StringBuffer use synchronized a lot, just in case:

public synchronized int length() { return count; }

Thing is, almost every application uses the StringBuffer in a single-threaded way. In other words, acquiring a lock and protecting the StringBuffer from simultaneous accesses from other threads is an expensive waste of time.

Escape analysis solves this problem. In most cases, it recognizes that the StringBuffer is used by a single method. So it knows for sure there's one and only one thread accessing it. In other words: the synchronized statement is superfluous. It can safely be omitted.

In the meantime, Aleksey Shipilёv published an article demonstrating the effect of the optimization with both performance measurements and the resulting assembly code. Check out the article written by Aleksey for all the gory details.

Moving objects from the heap to the stack

Many sources report that escape analysis moves Java objects from the heap to the stack. As Aleksey Shipilёv points out in his article about scalar replacement, the JVM does not do this implementation. It's just a misconception. But it's an interesting one, and I wonder why the JVM doesn't implement it.

What does escape analysis tell us? It tells us which variables are used exclusively within a certain scope. For example, it tells us that a variable is used only as a local variable of a method and that this variable is not going to be used by another method. In particular, it's not a return value.

Among other things, that means that the variable is not used by any other thread. So we can safely store it in a thread-local memory.

Apart from the CPU registers, the fastest thread-local memory is the stack.

What is the stack and why is it faster than the heap?

Basically, the stack is where the return addresses of function calls are stored. Before calling a method or a function, the CPU stores the current instruction pointer on a special memory location, the stack. If it calls a nested function later, it piles the new instruction pointer on top of the old one. So you've got a stack of return addresses. At the end of the function, the CPU pulls (or "pops") the topmost address from the stack, stores it in the instruction pointer and uses it to determine where to continue the program execution.

Obviously, the stack plays an important role in any program. So it's not surprising stack operations are heavily optimized. The stack is almost always in the fast L1 cache. And the operations storing and retrieving data from the cache are very fast. Plus, the pointer to the cache is a CPU register, which speeds cache access even further.

Putting local variable on the stack

At the same time, every CPU I know of allows programmers to store user data on the cache. That's a very convenient way to implement local variables. I've briefly described the idea in my article about Java byte code and my article about Assembly language. Nowadays, modern compilers do their best to avoid using the stack in favor of using registers, but the basic idea is simple: put every local variable on the stack. Before returning from the function, just remove every local variable from the stack. Basically, that's simply adding a number to the stack pointer, so that's just a single CPU cycle. After that, the local variables are inaccessible - just the way it should be.

As a side effect, you don't have to deal with garbage collection. Garbage collection is needed for the objects which might be used several times. In the case of local variables, you know for sure that the local variable is used only once. So you can remove it without further ado. That's a major performance benefit.

Object destructuring in Java

However, that's precisely what Java does not do (at least not Java 8, and if I'm not mistaken, Java 9 doesn't, either). More precisely: Java doesn't store any object in the stack. It does store local variables on the stack: primitive types like int or boolean. It also store the pointers to objects on the stack. But it doesn't store the objects themselves on the stack. Anything you create with new is always created on the heap.

Instead, since Java 6 it does something much more exciting: Scalar replacement.

Consider this code (shamelessly copied from Aleksey Shipilёv's great article):

public static int main(String... args) { MyObject o = new MyObject(x); return o.x; } static class MyObject { final int x; public MyObject(int x) { this.x = x; } }

A close look at this code reveals that the object MyObject is never really used. All that's really needed is the variable x of the object. That's one of the things the escape analysis algorithm of the JVM can detect.

Actually, it won't detect it in my version of the algorithm, because I used the main method for the sake of simplicity. In other words, the method is called only a single time. The original algorithm of Aleksey used a regular method and he called it thousands of time. That's important because the JVM compiler and optimizer kicks in only after a certain threshold. As a rule of thumb, it takes up to ten thousand iterations until the code is fully optimized. The details vary on how you configured the JVM and on your Java code.

Scalar replacement

Let's assume the code is fully optimized. In that case, the JVM detects that the object MyObject is not used outside the main method. Next, it detects that the object itself is never used. The only thing that's really used is the variable x of the object.

So the JVM eliminates the object from the equation. It stops creating new objects. It just works with the integer x. Awesome! Mind you: the object needs to be created on the heap. We've seen above that it suffices to create it on the stack, but that's something Java doesn't do. Even if it did, replacing a full-blown object by a simple 32-bit integer is a huge performance boost.

Better than using the stack

In this particular case, scalar replacement is even better than just putting the object on the stack. First, the object isn't created at all. It's just as if hadn't been defined by the programmer. Second, there's only a single integer. The JVM can easily store this value in a CPU register. So neither the object nor the variable x materialize in real RAM. It stays in the CPU.

Limitations to scalar replacement

To bad Aleksey also shows the limits of the JVM optimizer. A simple if statement suffices to confuse the optimizer. So I wonder how much good this particular optimization does in real-world applications.

Wrapping it up

Be that as it may, I'm deeply impressed by escaped analysis and the optimizations it leverages. I guess there's more to come in the future. Currently, many scientists are investigating the topic. For example, have a look at Partial Escape Analysis and Scalar Replacement for Java by Lukas Stadler.

Kudos

This article was inspired by Aleksey Shipilёv and his article about scalar replacement. I've done my best to add some extra value and background information to Aleksey's article. But that doesn't stop me from recommending to read Aleksey's article. Truth to tell, I recommend the entire series of articles he's written. Thanks, Aleksey!


Dig deeper

Synchronization optimizations in Mustang (aka Java 6)

Escape analysis

Java â„¢ HotSpot Virtual Machine Performance Enhancements

Java theory and practice: Urban performance legends, revisited by Brian Goetz (2005)

c++ - Which is faster: Stack allocation or Heap allocation - Stack Overflow


Comments