Performance Impact of Functional Programming in Java

Contemporary Java shows a couple of unique traits and mannerisms. When Lambda functions became part of Java after many years of heated discussion, the Java community used it to do surprising stuff. I can't evade the impression that programmers felt obliged to embrace it full-heartedly once functional programming came into reach.

And they do. Chaining functions is strangely attractive to many programmers. Sometimes I agree, sometimes I don't. I'm always a bit puzzled when people replace an if statement with a combination of Optional.of(), filter(), and orElseGet(). The rationale is that nested if statements have a tendency to get messy, while chains of functional operators look a lot cleaner. This example shows the idea - and it also shows that the idea doesn't always work:

functional approach

procedural approach

Optional.of(LocalDate.now().getDayOfWeek()) .filter(currentDay -> EnumSet.of(DayOfWeek.SATURDAY, DayOfWeek.SUNDAY) .contains(currentDay)) .ifPresentOrElse( (currentDay) -> System.out.println("Enjoy life!"), () -> System.out.println("Endure your work day!"));

But beauty is in the eye of the beholder, so it's not up to me to judge. Nonetheless, I do recommend listening to the talk of Stuart Marks. His rules when to use functional programming and when not to use it resound with me.

Remains the topic of performance. When I talk to Java programmers, I often observe they aren't aware of the performance penalty of Java's approach to functional programming. Both code snippets above do the same thing, so many programmers expect the compiler to emit identical machine code. I've even met programmers believing functional programming is faster. But it isn't.

Benchmarking performance

Birds view

The functional example uses a Lambda function and an object (i.e. Optional). Creating a new object is always expensive. Lambda functions are basically anonymous inner functions. So using a functional algorithm always creates zillions of objects, adding a lot of pressure on the garbage collector.

That's what I thought when I started running the benchmarks. As it turns out, I was wrong. Not entirely wrong, but contemporary Java has pretty neat optimizations. The benchmark results surprised me.

Remains the fact that functional programming adds a level of complexity to your algorithm. Probably more than one layer, if you're looking at how the functional libraries are implemented. That's bad because the just-in-time compiler of the JVM kicks in when a method runs hot. The longer your method is, the later it's compiled. The more nested method calls you've got, the more time it takes to optimize the code. The JIT compiler usually starts with the bottom-most method and works it's way up to the top. But it doesn't compile every method as once. Instead, it observes the compiled code and kicks in when it notices another method runs hot. We'll see that in a minute.

At first glance, our example doesn't have many nested method calls. They are hidden. Every Lambda function is a method, too, so calling a Lambda in a stream adds another layer to the call stack.

Note that I didn't test the garbage pressure bit. Strictly speaking, all I can say is that the functional approach is surprisingly fast if you've got plenty of free main memory.

What does the byte code look like?

The byte code reveals the nested calls more clearly. The functional approach is a bit longer (54 bytes vs. 42 bytes), and it has a second method called lambda$max100$0. As mentioned above, that's good news because every method is an opportunity to compile native machine code, but it's also bad because there's more code to compile.

functional approach

procedural approach

private int max100(int); Code: 0: getstatic #20 // Field map:Ljava/util/HashMap; 3: iload_1 4: invokestatic #21 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer; 7: invokevirtual #22 // Method java/util/HashMap.get:(Ljava/lang/Object;)Ljava/lang/Object; 10: checkcast #23 // class java/lang/Integer 13: invokestatic #24 // Method java/util/Optional.ofNullable:(Ljava/lang/Object;)Ljava/util/Optional; 16: iload_1 17: invokedynamic #25, 0 // InvokeDynamic #1:get:(I)Ljava/util/function/Supplier; 22: invokevirtual #26 // Method java/util/Optional.orElseGet:(Ljava/util/function/Supplier;)Ljava/lang/Object; 25: checkcast #23 // class java/lang/Integer 28: invokevirtual #27 // Method java/lang/Integer.intValue:()I 31: ireturn private static java.lang.Integer lambda$max100$0(int); Code: 0: getstatic #20 // Field map:Ljava/util/HashMap; 3: iload_0 4: invokestatic #21 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer; 7: bipush 100 9: invokestatic #21 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer; 12: invokevirtual #28 // Method java/util/HashMap.put:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; 15: pop 16: bipush 100 18: invokestatic #21 // Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer; 21: areturn

There's an interesting bit in line 17 of the functional code. It uses invokedynamic to generate an anonymous inner class. Each time you're defining a Lambda, you're actually creating an inner class, just as we did in the old days before Java 8. In a nutshell, Lambda functions are just syntactic sugar (apart from the invokedynamic bit, which is a more efficient implementation). The inner class is a moderate performance penalty, but only when it's called for the first time. Once it's there, it's reused. Spoiler: the JIT compiler even eliminates it, as we'll see in a minute.

If you're curious about how Lambdas are implemented and why the Java team chose this solution, have a look at this talk by Brian Goetz.

Running the benchmark

Let's have a look at the test results. I've used AdoptOpenJDK 11.0.5. For some reason, JDK 11 is faster than JDK 14, so I deliberately chose to use the LTS version instead of using the bleeding edge JDK. For the sake of completeness, I've also added the test results for GraalVM.

functional

procedural

functional (GraalVM 20.1.0)

procedural (GraalVM 20.1.0)

Running the benchmark with a cold JVM Time elapsed: 10.539 ms Warmup phase Test phase Time elapsed: 0.011 ms

The benchmark has two phases. First, it runs the algorithm 1000 times immediately. That's the "cold start" test. The JVM has just powered up, so it didn't have an opportunity to compile the source code yet. One thousand repetitions are enough to activate the JIT compiler, so it's not a 100% cold start test. So it shows the performance of an algorithm that's run every once in a while, but not frequently enough to have the JIT compiler optimize it fully.

The cold start test shows a remarkable performance difference. The functional approach is roughly 20 times slower than the procedural approach.

The warm-up phase runs the benchmark for 10, 20 seconds. That's plenty of time for the JIT compiler. A longer warm-up period doesn't change much, so it's probably enough time to run every optimization available.

The speedup is impressive. The functional approach runs 1000 times faster. In contrast, the procedural approach gets merely 40 times faster. But that's because it's so much quicker when the JVM is cold. The performance of the functional algorithm and the procedural algorithm is almost on par. The numbers vary between each run, but usually, the functional approach is between 20% and 50% slower.

GraalVM manages to run the Optional.of() version of the code a lot faster during the cold-start phase. But it's still a factor five performance penalty. In the long run, the results are pretty similar. So I'll focus on AdoptOpenJDK for the rest of this article.

Early recommendations

At this point, we can already draw some conclusions.

It hardly ever pays to optimize the last 20% of performance. Usually, you can gain much more by adopting a different algorithm. Even if you need the performance boost, beware. Compiler technology continues to evolve. The 20% performance boost of today may become a performance penalty tomorrow.

Does this mean you can choose freely between if and Optional.of()?

Probably not. Real-world examples are usually much more convoluted than our benchmark. There are most nested calls, the methods are larger, and so on. As a result, the JIT compiler needs more time to kick in. Depending on your code, the cold-start phase may become dominant. And we've seen the vast performance penalty there.

Watching the JIT compiler performing its magic

To back up this claim, I've collected the output of the compiler. You can access the raw data on GitHub. If you're interested in the gory details, have a look at short description of the log file format or read it in much more detail here.

The compiler protocol reveals most of the compilation occurs during the first test, which was intended to be the cold-start test. Obviously, I fell short of this goal. At the end of the cold-start test, the JVM is already hot.

However, you can see that the optimization goes on during the warm-up phase. The functional log counts 20 compilations, as opposed to 12 runs of the procedural approach. You can also see that the functional code compiles 65 times during the cold-start test, four compilations more than its procedural counterpart.

JITWatch shows this in a nice graphic when the compiler kicks in. The two graphics are remarkably similar. The procedural example still counts 20 compilation events less, and more importantly, the optimizer stops roughly two seconds earlier. After roughly twenty seconds of busy optimization, the optimization has reached a decent level, and the JIT compiler becomes mostly idle:

functional

procedural

The diagram shows all the compilation events, no matter what is compiled. Most of these events compile internal Java methods, such as the implementation of the HashMap. Most methods are compiled multiple times: the C1 compiler generates moderately efficient code quickly. The C2 compiler emits a highly optimized code slowly, and each time the optimizer kicks in, the code is recompiled. If you're interested, read my article about the two compilers or the article on HotSpot JVM optimization by Christophe Hesters.

The green and yellow labels specifically refer to our max100() method. The procedural algorithm is compiled twice, the functional algorithm is compiled three times. The label of the second C1 compilation is almost hidden behind the C2 compilation.

For some reason, JITWatch counts more than 300 compilations. It seems that the log file emitted by -XX:+PrintCompilation is only a selection of the compilation events. This also shows in the ids of the log files (the numbers in the first column).

Assembly code analysis

JITWatch also allows us to have a look at the Assembly code. That's a lot of information, so suffice it to pick a few interesting details. The first such detail is the size of the Assembly code: the Optional.of() approach requires 800 bytes - or 25% - more than the if statement.

functional

functional (lambda)

procedural

I've mentioned the invokedynamic statement above. JITWatch displays it crossed-out. Unfortunately, the tooltip didn't make it into the screenshot; that's a pity. It explains that the bytecode has been removed during inlining, backing my claim that using an anonymous inner class to implement the Lambda doesn't have much of a performance impact in the long run.

The second tab shows the Lambda itself. There's bytecode, but there's no Assembly code. That's probably because the byte code has been inlined before compiling it to Assembly code. In other words: the Lambda function is eliminated. Only the core implementation remains. The "compile chain" view of JITWatch confirms this theory:

functional

procedural

I had to squash the graph to put everything on a single page - sorry about that, now it really looks ugly! - so you'll want to view the images in full-screen mode. The graph shows everything is inlined and compiled as a single method. That's why our tiny method ends up with 3200 to 4000 bytes of Assembly code. The Assembly code contains the entire implementation of HashMap.get() and HashMap.put().

In particular, the overhead of calling a Lambda function is eliminated. After 20 seconds of optimization, there's not Lambda function to call because it has been inlined.

The procedural algorithm is still a bit faster because it's more straight-forward. The compilation chain graph shows that, too. Less code to inline amounts to fewer instructions to run.

A more realistic example

I've prepared a second example for you. At first glance, the results are much less conclusive. It doesn't seem to matter if you use if statement to check whether the string at hand is a valid zip code, or if you prefer the Optional road:

functional approach

procedural approach

functional performance

procedural performance

private boolean isZipCodeValid(String zipCode) { return Optional.ofNullable(zipCode) .filter(Predicate.not(Strings::isNullOrEmpty)) .filter(s -> s.length() == 5) .filter(s -> s.matches("[0-9]+")) .or(() -> { log.error("invalid zip code: {}", zipCode); return Optional.empty(); }) .isPresent(); }

The procedural version starts a bit faster, but it even seems to be a bit slower in the long run. The JIT compiler never ceases to surprise me.

Here's the catch: if you're really concerned about performance, you can optimize the procedural approach easily. This version gives you a factor ten performance boost:

optmized procedural approach

optimized procedural performance

private boolean isZipCodeValid(String zipCode) { boolean valid = zipCode != null; valid = valid && zipCode.length() == 5; valid = valid && Character.isDigit(zipCode.charAt(0)); valid = valid && Character.isDigit(zipCode.charAt(1)); valid = valid && Character.isDigit(zipCode.charAt(2)); valid = valid && Character.isDigit(zipCode.charAt(3)); valid = valid && Character.isDigit(zipCode.charAt(4)); if (!valid) { log.error("invalid zip code: {}", zipCode); } return valid; }

But they said functional programming is faster!

So why do so many developers even think functional programming is faster than procedural programming?

I can only guess, but there are two potential reasons: when Mark Reinhold and Brian Goetz announced to add Lambdas to Java, they mentioned the performance boost by running a forEach() in parallel on multiple CPU cores.

Plus, functional programming avoids side-effects. That, in turn, usually makes it easier to optimize for performance. Multi-threading loses a lot of its sting when you embrace functional programming. Basically, that's using the immutability pattern in a much more natural way.

Wrapping it up

To us human beings, adopting functional programming is climbing one step up the ladder of abstractions. More often than not, that helps us to understand the source code. Of course, you can overdo it, but that's another day's story.

From your CPU's point of view, things look a lot worse. Replacing a for or if by its functional counterpart is climbing up the abstraction ladder several steps. More to the point, functional programming - and the Java implementation of functional programming in particular - means that method calls are nested deeper.

The JIT compiler optimizes at the method level. It only optimizes the most promising methods. That's methods that have run hot by being called time and again, and short methods are compiled earlier. The longer a method gets, the less likely it's optimized.

So I expected a performance impact of functional programming. To my surprise, it's there, but almost exclusively during the cold start phase. The JIT compiler optimizes most of the overhead introduced by Lambda functions away. There's always a little performance penalty because the algorithm is usually a bit more convoluted, but in the long run, it's a close result.

Real-world programs usually involve slow stuff like a database or file access, so you're free to choose your preferred approach without worrying about performance. But you get the idea: it's a good idea to embrace functional programming, but performance is none of them.