; last updated - 20 minutes read

GraalVM promises to be the Swiss army knife among the JVM. It runs a wide range of languages and allows you to use them side-by-side in the same application. But at its heart, GraalVM is still a virtual machine for Java. It's a plug-in replacement for the JVM. Remains the question why you should do that. Which advantages does GraalVM have over the OpenJDK? If we're using GraalVM to run our good old Java application, it can't be any killer feature. So the most convincing feature of GraalVM is speed. Today we examine how fast GraalVM really is, and - much more interesting - what fuels the performance of both GraalVM and OpenJDK.

GraalVM as a plug-in replacement to the traditional HotSpot compiler

A high-level perspective on the GraalVM architecture looks like so:

Today, we'll concentrate on the gray box. You can use the GraalVM to run every language compiling to Java bytecode. Actually, that's where the GraalVM story started. The AOT compiler and Truffle are more recent additions. Nowadays, GraalVM even ships with tools like VisualVM and Flight Recorder. You won't miss anything.

Java compilers as a plugin

The magic keyword in the architecture sketch is the small box called "JVMCI". You know, the name "GraalVM" is sort of a misnomer. The vast majority of your GraalVM installation is identical to your good old JDK. The chief difference is the compiler. More to the point, one of the two compilers, the C2 compiler.

If you've never heard about the C2 compiler before, have a look at our previous article about bytecode and the two JVM compilers. It provides you with all the background knowledge you need to read on.

That settled, let's examine the differences between the HotSpot C2 compiler and the Graal compiler. In particular, the optimization strategies caught our eye. That's where performance is gained and lost.

Speculative optimization

Let's get our hands dirty. And we're doing it all wrong - at least according to the rules of proper benchmarking. Violating these rules can be instructive: it shows how the compiler works internally.

We've found an example in an article warning about the perils of micro-benchmarking, written by Angelika Langer and Klaus Kreft. Usually, when you're running a benchmark, you don't want the JIT compiler to mess up your results. The article describes how to use the JMH framework to avoid such traps. Today, we'll head in the opposite direction. We'll do without JMH.

That said, please have a look at our repository on GitHub:

interface Counter { int inc(); } class Counter1 implements Counter { private int x; public int inc() { return x++; } } class Counter2 implements Counter { private int x; public int inc() { return x++; } } class Counter3 implements Counter { private int x; public int inc() { return x++; } } public class Measure { public static void main(String[] args) { measure(new Counter1()); measure(new Counter2()); measure(new Counter3()); } public static void measure(...) {...} }

The measure method calls the Counter.inc() algorithms one hundred million times. The important bit is that measere uses the interface to call the counter. It does not call the class containing the implementation. So the JVM has to look up the actual implementation each time.

The first line does precisely the same as the second line: it calls new CounterX() and calls the inc() method one hundred millions of times. So how long do you expect to take the first, the second, and the third method call? It stands to reason identical algorithms run at the same speed, doesn't it?

We hardly couldn't believe the result. measure(new Counter1()) runs more than five times faster than measure(new Counter2()). Not to mention measure(new Counter3()), which is even slower. What's going on?

Iteration Counter 1 Counter 2 Counter 3
1 48 ms 185 ms 288 ms
2 45 ms 172 ms 290 ms
5 33 ms 177 ms 286 ms
OpenJDK 64-Bit Server VM AdoptOpenJDK 15.0.1+9

By the way, we've run the benchmark first with OpenJDK 15. We'll come back to GraalVM in a minute. It's also important to know we took the times for counter 1 first, after that we ran counter 2 five times, and finally we finished with five iterations of counter 3.

The speedup in the columns shows the effect of the optimizing compiler. The longer the program runs, the more it's optimized. The first row also suffers from the performance of the interpreter that's employed first. However, we've double-checked that the compiler kicks in very fast. The benchmark is small enough to compile in during the first hundred milliseconds. So let's focus on the difference between the columns.

And that shows the program slows down tremendously when Counter2 is used. And it slows even more when we start using Counter3.

Speculative method call optimization explained

What we're seeing is the effect of a speculative optimization.

Like many other modern programming languages, Java always uses virtual method calls, allowing you to override the method. From a compiler's perspective, that's bad news. In Assembly language, there's an opcode for calling a function. It's pretty fast. According to an instruction table we've found, it takes 2 to 22 CPU cycles. In more familiar terms, a "near call" takes roughly one nanosecond on a fast CPU. "Far calls" take up to ten nanoseconds. Simplifying things a bit, "near" and "far" refer to the distance of the methods in your source code. That's an interesting insight, by the way: moving a frequently-called method closer to the line calling it may have an impact on performance.[1]

Thing is, the parameter of the Assembly-language opcode is a constant. In other words, the compiler decides at compile time which method in which class is called. Virtual method calls aren't constant at all. The JVM has to look up the correct method at runtime. Which method is called, depends on the class hierarchy. measure() always calls Counter::inc, but the method Counter::inc refers to changes over time. It's a moving target. In our case, it's Counter1::inc at first. Later it's Counter2::inc, until Counter3::inc takes over a short time later.

Here's the catch. Granted, interfaces and class hierarchies are popular. Modern software development relies heavily on these tools. But that doesn't mean there are many implementations to an interface. In real life, that rarely happens. If there's a second implementation of an interface in your project, it's most likely a test class. It won't make it into production.

So the C2 compiler is allowed to cheat. It uses the fast opcode to call the method of the single implementing class that's really used. From a general perspective, that's wrong. In our example, that works half a billion times. But when we call measure(new Counter2()), it turns out the assumption was wrong. The C2 compiler knows that may happen, so it adds a guard to the method call. If the assumption "there's no counter but Counter1" breaks, the guard immediately de-optimizes the code.

De-optimization means that the machine code is thrown away. For a short time, the JVM is using interpreted mode or the C1 compiler again. The second thread starts the C2 compiler again, this time with updated assumptions. The compiler takes into account there are two different implementations to chose from.

When the program reaches measure(new Counter3()), the assumption turns out to be wrong again. Back to start. This time, the C2 compiler generates the code for the general case, working for an arbitrary number of implementations of the interface. As a result, the program slows down again.

Come to think of it, that's pretty surprising. The JIT compiler deliberately does something that might be wrong. It doesn't want to do the wrong thing, so it has to check the assumptions before every single call. This guard alone takes some time. But it's still faster than going the extra mile implementing an actual polymorphic call. De-optimization is even more painful. But again, risking a de-optimization here and there almost always pays. Our benchmark is the exception: it has carefully been designed to break the assumptions of the JIT compiler.

Watching the compiler at work

We can even see this. The Ideal Graph Visualizer tool allows use to visualize the log files GraalVM generates when adding the parameter -Dgraal.Dump. Strictly speaking we're still talking about OpenJDK, but the basic idea is similar, so the graph shows the idea for both compilers - at least at a high level. The graphs tend to get large, so we show only a small part of them. You can see the full-size graphs at our GitHub repository.

As long as only Counter1 is used, the graph looks like so:

You can see an if-statement added by the C2 compiler. It checks if it's still true that Counter1 is the one and only implementation of Counter. If it's not, the red box is executed. It's a de-optimization. In other words, the compiled code is thrown away and replaced by a more refined version. That takes a while. In the meantime, the program is run by the C1 compiler or even by the interpreter.

Running the benchmark on GraalVM

When we ran the benchmark on GraalVM, we were disappointed. It ran the code faster, but not much. Plus, the results were confusing. But when we stopped using the eommunity edition in favor of the enterprise edition of GraalVM, the results become very convincing. Unfortunately, we aren't allowed to publish any benchmark results, so we'll stick with the community edition results.

At this point we'd like to send kudos to the people populating the GraalVM Slack channel. They helped us with tons of information and pointed out several flaws of our benchmark. Awesome!

Speculative execution allows for a super-fast Counter1. The other two counters are slower, but there's not the heavy penalty we've observed before.

If you're the curious one, you can use JitWatch written by Chris Newland to examine the Assembly code. That's what Ionuţ Baloşin did in his remarkable analysis. The Ideal Graph Visualizer also shows what's going on. The if-statement has become a type switch, and there's a direct call for all three implementations:

Data flow analysis

A quick glance at our benchmark shows it's a big waste of time. It spends a lot of time incrementing counters, but the result is always the same. It's just a bit confusing because of the integer overflows. You can easily replace the loops by returning a constant. GraalVM can often detect such a situation. Martijn Dwars shows this using an example that's not trivial at all.

More generally speaking, data flow analysis gathers information about the data used in an application. The compiler can use this information to generate optimized code. The beauty of a JIT compiler is it's even able to generate different code depending on the user input.

Interprocedural optimization

Aleksandar Prokopec reports that GraalVM optimizes across methods. Traditionally, JVM compilers only optimize individual methods. That makes sense because it improves performance. However, this also amounts to ignoring knowledge you can use to optimize the code. Aleksandar demonstrates this with this code snippet:

// shortened version of the source code at https://medium.com/graalvm/stream-api-performance-with-graalvm-be6cfe7fbb52 public double averageAge(people: Person[]) { return Arrays.stream(people) .filter(p -> p.age >= 18 && p.age <= 21) .mapToInt(p -> p.age) .average() .getAsDouble(); } class Person { public final int age; } /** this is an excerpt of a JVM class */ public class OptionalDouble { public double getAsDouble() { if (!isPresent) { throw new NoSuchElementException("No value present"); } return value; } }

Have a look at the method getAsDouble(). Chances that's method you're using frequently without thinking twice. However, it's not simply a getter. It may throw an exception.

To human readers, it's obvious this will never happen. The field age is an int, so we know for sure the Optional is never empty. It's just an unlucky fact that both IntStream and IntStream::average< use the wrapper class Integer instead of the primitive type int. So there's an extra check that's never used.

Interprocedural optimization allows the GraalVM to get rid of it. It doesn't optimize methods, but optimizes the entire AST tree. That makes it easy to see the Optional is never going to be empty, so GraalVM optimizes the check away. It just adds a guard, so it can de-optimize the code if the assumption turns out to be wrong.

Optimizing chains of Lambdas

Ionuț Baloșin reports another interesting optimization. Chaining Lambda functions has become popular in the Java world. In most cases, it's easy for a human programmer to replace them with their much-faster procedural counterparts, but it seems to be quite a challenge for the JVMs. GraalVM does a remarkbly good job at optimizing long chains of Lambda. In Ionut's example, they come even for free, which is truly the holy grail of optimization. Again, we can only speculate: Interprocedural optimization allows the compiler to inline the function calls, replace the Lambda function by ordinary function calls, get rid of unnessary boiler plate code using escape analysis. When the optimizer has performed its magic, we end up with the equivalent of the good old for loop. It's just a little slower because of the occasional de-optimization guard.

That's such a nice theory we had to repeat the test ourselves. The first result was devastating: GraalVM runs the average-age demo slower than AdoptOpenJDK. A lot slower. So we took the benchmark written by Ionuț Baloșin and ran it. That was disappointing, too. This time GraalVM 20.3.0 is faster than AdoptOpenJDK 15, but we couldn't reproduce the massive slowdown Ionuț reports. That showed only when we tried AdoptOpenJDK 11.

OpenJDK 64-Bit Server VM GraalVM CE 20.3.0 (build 11.0.9+10-jvmci-20.3-b06, mixed mode, sharing)

Iteration Chaining Lambdasfor loop
1 362 ms 123 ms
2 272 ms 92 ms
3 277 ms 89 ms
4 295 ms 89 ms
5 275 ms 90 ms

Cutting a long story short, GraalVM optimizes expertly, but that doesn't mean the development of OpenJDK has come to an end. Quite the contrary. OpenJDK 15 copes a lot better with Lambda than earlier versions. Sometimes GraalVM is faster, sometimes OpenJDK ist faster. It's hard to predict which one has an edge in your particular use-case. So when a Whitepaper boasts with 20%-30% performance boost, take it with a grain of salt. In this particular case, the baseline was Java SE 8.

Other optimizations

Both the HotSpot compiler and the GraalVM compiler ship with many more optimizations. Let's concentrate on the differences. According to Ionuț Baloșin, there are several distinguishing optimizations of the GraalVM:

  • Partial escape analysis. This idea has been around since 2014. If the compiler detects a global variable is used only locally, it can be replaced by a local variable. This usually means there's no need to allocate memory for the variable, and often it suffices to hold the variable in a CPU register, eliminating memory access entirely.
  • Improved inlining.
  • Guard optimizations enabling highly speculative code is another idea from 2013. It has its biggest impact not in the Java language, but in languages like JavaScript or Ruby. For instance, there's no need to compile the else part of an if statement as long as the condition is always true. If you're worried about the cost of the ubiquitous null-checks, this optimization is your friend. The expensive error handling code you've implemented in the else branch is never called, so why bother compiling it? Just pretent it's not there.

However, there are also several optimizations missing in GraalVM, according to the analysis of Ionuț Baloșin in early 2019:

  • The HotSpot compiler optimizes nested loops nicely. That's called vectorization. The classic example is multiplying two mathematical vectors, but I suppose it also optimizes business code. If the inner loops are independent of each other, they can be performed in parallel. According to an Oracle whitepaper, vectorization is an exclusive feature of the enterprise edition. But this is an open-source blog, so the enterprise edition of GraalVM is out of our scope.
  • Loop unrolling is a similar technique. A for loop running n times can be optimized by simply copying the loop body n times. That's more efficient because checking the loop variable and jumping to the start of the loop costs time. The long pipelines of modern CPUs add to the cost. A wrong branch prediction can easily cost as much time as 10 to 20 instructions.

There are many more optimizations (just have a look at down the rabbit hole), but by now, you've got the gist of it. Both GraalVM and HotSpot are highly optimizing compilers. Sometimes GraalVM generates better code; sometimes, it generates worse code.

Digression: what about the J9 compiler? What about Java 13?

There's more to the Java universe than Java 11 and the GraalVM. There's Azul, there's Coretto, there's Alibaba Dragonwell, just to name a few. Too many choices to cover in a single article. We just looked at three other alternatives:

  • GraalVM is part of OpenJDK 11. You can activate it with two JVM parameters: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI. When we did so, we were utterly disappointed. We didn't see any difference. No advantage, no disadvantage, no incompatibility. That's great, but we expected more. After some research, we've discovered that the commits to the repository of AdoptOpenJDK stop in Summer 2018. It shows that the improvements of the last two years are missing. If you're concerned about performance, download the GraalVM directly.
  • If Java 11 is good, sure Java 13 is greater? That may be true, feature-wise. However, we were surprised to observe a small performance penalty after upgrading to Java 13. Whatever that means. Maybe the long-term support (LTS) versions are optimized better. Maybe we just had bad luck, choosing a less-optimized version of Java. As we've noticed above, the performance of Java 15 has improved again. It runs the test two or three percent faster than Java 11. Truth to tell, that'a below the measurement accuracy of our setup. Remember, we deliberately didn't follow the well-established best practices of benchmarking, so there are many things messing up with our results. Repeated tests runs vary by up to five percent.
  • AdoptOpenJDK comes with your choice among two different compilers. Apart from using the HotSpot compiler, you can also choose the J9 compiler. It's been invented in the realm of Eclipse and IBM. J9 ran the microbenchmark of Angelika slower than its contenders, but it started the BootsFaces showcase 10% faster than the GraalVM and 20% faster than HotSpot.

Wrapping it up

About the co-author

Karine Vardanyan occupies herself with making her master at the technical university Darmstadt, Germany. Until recently, she used to work at OPITZ CONSULTING, where she met Stephan. She's interested in artificial intelligence, chatbots, and all things Java.

GraalVM is a new compiler, based on modern concepts, written in Java, and published as an open source project on GitHub. All this opens a window of opportunity to deliver superior performance in future. As for today, our tests show it's more or less on par with the tradtional HotSpot compiler. In particular, we didn't encounter any compatibility problem - with the exception of Eclipse on MacOS. Try as we might, it wouldn't start because of a rogue compatibility checks.

The beautiful thing about GraalVM is that it's merely a plug-in. The JVMCI API introduced with Java 9 allows us to switch between the HotSpot compiler and GraalVM easily.

We're curious about what the future has in store for us. GraalVM is a fresh, new start, shedding the burden of 20+ years. Everybody familiar with Java and with compilers can add their ideas and improvements to GraalVM. That's a considerable number of developers, so chances are we'll see exciting advances in the future.

Dig deeper

Ionuț Baloșin's talk comparing the optimization strategies of GraalVM and HotSpot (October 2019)

Details performance results of Ionuț Baloșin's performance tests" (November 2019)

Down the rabbit hole - Charles Nutter's talk about "what the JVM does when you're looking away"

GraalVM: Run Programs Faster Everywhere, a talk by Alina Yurenko

Oracle whitepaper on GraalVM Enterprise Edition

Practical Partial Evaluation for High-Performance Dynamic Language Runtimes

sPartial Escape Analysis and Scalar Replacement for Java

An Optimization-Driven Incremental Inline Substitution Algorithm for Just-in-Time Compilers

An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler

Ionuț Baloșin on chaining Lambda optimizations

Aleksandar Prokopec

about the perils of micro-benchmarking

Our repository on GitHub containing the demos of this series of articles

  1. Note that this is an extremely low-level optimization. Use it only after trying everything else. In the long run, it's better to keep your code clean than trying to do the job of the JVM optimizer yourself.