Isn't it funny that I start writing a series of articles about Java 8 Lambdas in 2018? Mind you: my first article dates back to April 2012, six years ago. I fell in love with functional programming even earlier. I suppose it was roughly in 2008, ten years ago. From this perspective, functional programming is nothing new to me. I'm using it on a daily basis when I'm writing TypeScript code. As things go, I work in a very conservative environment, so I know a lot of projects that have adopted Java 8 last year. And I'm bewildered what programmers make of it - and why. It's high time for a reality check. Java's Lambdas and method handles are great, no doubt about that. Streams are great, too. Sometimes. More often than you might think, they are not.

Before you get me wrong: I love functional programming!

The performance saga

When Mark Reinhold suggested to include closures with Java 8, he explicitly mentioned performance. More precisely, he wrote about multi-core processors. Functional programming offers a great opportunity to embrace multithreading without the pain. There's that. Over time, closures became Lambdas. In other words: the Java team decided to start small, implementing only a part of the feature.[1. To be honest, full-blown closures are poison to functional programming. Granted, they are very elegant, doing exactly what the programming intuitively expects them to do. But the advantage of closures over Lambdas is that they can have side-effects. Side-effects, in turn, are precisely what you do not want to have in functional programming. The beauty of a function is that it always returns the same output to identical input. Closures potentially rely on - and even modify - context state. That makes them ill-suited for automated tests, just to name a single problem.] In contrast, the multi-core promise came true. The Java 8 stream API offers a seldom-used feature allowing to execute stream processing on multiple cores. But there's nothing like a free lunch. Joshua Bloch has an example of parrallel stream processing backfiring. The moral of the story is that Java 8 streams are great because they allow for multithreading without the pain, thus improving performance a lot. Only sometimes it doesn't improve performance. Beware! I'm not entirely sure the next sentence is true. But my impression is that many programmers believe that functional code is faster than procedural code. It has to be, it's so much more advanced, and it's so much shorter. So let's examine the performance of Java stream processing.

The unfair benchmark

I admit it: this benchmark has carefully been designed to make Lambdas look bad. It simply adds the numbers from 1 to 20. That's a home run for procedural programming, and the overhead of functional programming becomes painfully noticeable. Thing is, it's not an unusual setup at all. I watch developers converting simple for loops to functional programming, sometimes for the sole reason that their IDE suggests it. This benchmark uses the JMH framework, which is a great choice to run small benchmarks without falling into the many pitfalls of microbenchmarks. One of these pitfalls is that the JVM optimizer recognizes we're not really doing anything with the data and that the data never changes. More often than not, the JVM simply eliminates the code we'd like to test because it considers it dead code. To avoid this, we put the numbers we want to add to a state object. Basically, that's just boiler-plate code preparing the real test: [sourcecode Java] @State(Scope.Thread) public static class MyState { public int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 }; public List<Integer> numberList = new ArrayList<Integer>() {{ IntStream.range(1, 20).forEach(n -> {numberList.add(n);}); }}; } [/sourcecode] As you can see, there's an array and an ArrayList, both containing the same set of numbers. In Java, streams always operate on a Collection, so I implemented the two algorithms using an ArrayList in order to compare similar things. However, I also was interested in how much the algorithm can be optimized by using a simple array. Now for the first two attempts to calculate the sum. The first method uses the procedural style, the second method employs a fancy functional reduce method: [sourcecode Java] @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public int testMethodProcedurally(MyState state) { int sum = 0; for (int i = 0; i < state.numberList.size(); i++) { sum += state.numberList.get(i); } return sum; } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public int _testMethodWithStreams(MyState state) { return state.numberList.stream().reduce((a, b) -> a + b).get(); } [/sourcecode]

Subsonic speed

Whow. On my 2015 MacBook running JDK 1.8.0_151, the procedural method completes within merely 20 nanoseconds. The functional algorithm takes 120 nanoseconds. That's six times slower - more or less what I expected to see. However, I was puzzled about the 20 nanoseconds. A quick (and very inaccurate) calculation revealed that the algorithm takes at least 100 instructions in total. That'd mean the CPU runs at the theoretical optimum of more than 1 instruction per cycle. Not impossible, given that the algorithm and the data is small enough to fit into the L1 cache, but extremely fast. Mind you: the first three computers I owned operated at a megahertz. The fasted instruction they performed took two milliseconds. That's roughly 100 times slower than the procedural sum algorithm, and even 20 times slower than taking the sum using functional programming. I became suspicious. Granted, a computer manufactured in 2014 is a lot faster than a computer dating back to the early 80s, but that seems a bit too much to be true. But it is. I'll spare you the gory details, but I went to great lengths to examine the machine code the JVM executes after optimizing the code as much as it can. Both methods are really executed. The JVM doesn't detect they always return the same result. So it seems to be true. Converting a good old for loop to a fancy functional stream makes it six times slower. Adding insult to injury, operating the procedural algorithm on a simple array doubles the performance again, making the functional approach a full magnitude slower.

Converting to streams is like running a computer four years old

If Moore's law was still true with respect to performance, the performance penalty was equivalent to running the algorithm on hardware that's four years old. In the IT world, that's ages. On the other side, that's a trade-off that makes perfect sense in many situations. For example, if your feel streams improve the maintainability of your program, go for it. A rationalization frequently heard is testing. It's easy to test a function. No side effects, no hassles. There are worse things than using a computer that four years old. The MacBook I'm using to write this text is three years old, and it doesn't feel like updating anytime soon.

But surely it's gonna improve over time!

I ran the performance test on JDK 1.8U151. Basically, that's the first version of Java supporting Java. So when I researched for this article, I've read many voices claiming "yeah, currently streams are slow, but surely they are going to improve over time!". So the first thing I did was to run the benchmark again on Java 9. Can you imagine my surprise when I saw it's 15% slower on Java 9? Actually, performance has already improved a lot over time, if you look at the early prototypes. In the early day, the idea was to convert Lambdas to "SAMs" - anonymous classes implementing a Single Abstract Method. Following this trait, using Lambdas would result in a lot of "...$1.class" files in your classpath, and a major performance penalty because of a plethora of inner classes created each time a Lambda is used. I was fairly surprised to learn how Lambdas are implemented in the current JDKs. It's a fairly complicated process, involving the invokeDynamic instruction introduced with Java 8, method handles, and even creating classes during runtime. To verify the latter claim, I deliberately caused an exception in the Lambda expression: [sourcecode Java] public int testMethodWithStreams(MyState state) { return state.numberList.stream().reduce((a, b) -> { int crash = 0 / 0; return a + b; }) .get(); } [/sourcecode] The stack trace of the resulting exception looks suspiciously like being caused by an inner class: [sourcecode] Exception in thread "main" java.lang.ArithmeticException: / by zero at de.beyondJava.performance.MyBenchmark.lambda$0(MyBenchmark.java:78) at java.util.stream.ReduceOps$2ReducingSink.accept(ReduceOps.java:123) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1380) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:479) at de.beyondJava.performance.MyBenchmark.testMethodWithStreams(MyBenchmark.java:78) at de.beyondJava.performance.MyBenchmark.mainTest(MyBenchmark.java:113) at de.beyondJava.performance.MyBenchmark.main(MyBenchmark.java:100) [/sourcecode] Looking from that angle, the performance penalty doesn't come as a surprise. In fact, many resources on the internet indicate there's a huge performance penalty if Lambdas are only used a few times. The JVM generates a new class on the fly when the Lambda is called for the first time. After that, this class can be reused. That's why so many resources on the internet report a giant performance penalty. Using a Lambda once is expensive, indeed. As soon as the JIT compiler kicks in, the performance penalty caused by Lambda themselves mostly vanishes.

Examining what really happens

As it turns out, Lambdas themselves only contribute a small part to the performance penalty. It's the surprisingly complex stream API making the difference. Running the benchmark with some JVM parameters show which methods are compiled during runtime. Add these parameters to the start command: [sourcecode] -Xbatch -XX:-TieredCompilation -XX:+PrintCompilation [/sourcecode] -Xbatch switches off a great feature of the JVM. By default, the JVM starts the JIT compiler in a new thread. Unfortunately, this causes log messages to arrive in parallel, making the output hard to read. -XX:-TieredCompilation forces the JIT compiler to optimize the assembly code as much as possible from start. Usually, it prefers to optimize the code no less than five times. In general, this is a great compromise between quick startup time of your application and maximum performance of the parts of the application that are really used frequently. However, in this article, we're happy to see the final result of all the optimizations, so we opt to use the C5 compiler directly. Finally, -XX:+PrintCompilation prints a line each time a method is compiled to assembly code. I've also examined the assembly code itself, but I'll spare you that because it's surprisingly long. Suffice it to say that the functional approach roughly takes roughly 900 lines of Assembly code, the procedure algorithm using the ArrayList takes roughly 333 lines of code, and the algorithm using a simple array takes less than 120 lines of code.

Inlining "hot" methods

How can so few lines of Java code translate to so many lines of machine code? First, the JVM adds a lot of stuff you're not aware of. There are safepoints, there's code to correct optimizations that were just too optimistic, and there's much more. There's also a lot of extra lines to align the memory addresses of the code and the data with the CPU architecture. Your CPU reads stuff faster if it's at an "even" address. In 2018, "even" usually means that it divides to 16. Sometimes, it pays to waste 15 bytes. You trade memory for performance. But all that doesn't explain the difference between the two procedural approaches. The clue is that the JVM inlines methods that are called often enough. Inlining means that the body of the method is copied to the caller. So you get rid of the overhead of calling a method. As a result, the Assembly code is kind of bloated, but it's a lot faster.

Complex implementation of the stream API

Back to the initial approach starting the application with the JVM parameters -Xbatch -XX:-TieredCompilation -XX:+PrintCompilation. Among other things, this reveals which methods are used frequently. Omitting the methods compiled before our method under test is compiled, the functional approach compiles these methods: [sourcecode] 231 68 b de.beyondJava.performance.MyBenchmark::testMethodWithStreams (29 bytes) 237 69 b java.util.Collection::stream (11 bytes) 238 70 b java.util.ArrayList::spliterator (12 bytes) 239 71 b java.util.ArrayList$ArrayListSpliterator::<init> (26 bytes) 239 72 b java.util.stream.StreamSupport::stream (19 bytes) 241 73 b java.util.stream.ReferencePipeline$Head::<init> (8 bytes) 241 74 b java.util.stream.ReferencePipeline::<init> (8 bytes) 241 75 b java.lang.invoke.LambdaForm$MH/250421012::linkToTargetMethod (8 bytes) 242 76 b java.lang.invoke.LambdaForm$MH/455659002::identity_L (8 bytes) 242 77 b java.util.stream.ReferencePipeline::reduce (12 bytes) 247 78 b java.util.stream.ReduceOps::makeRef (17 bytes) 248 79 b java.util.stream.ReduceOps$2::<init> (11 bytes) 248 80 b java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes) 249 81 b java.util.stream.AbstractPipeline::evaluate (94 bytes) 252 82 b java.util.stream.TerminalOp::getOpFlags (2 bytes) 252 83 b java.util.stream.AbstractPipeline::sourceSpliterator (265 bytes) 253 84 b java.util.stream.ReduceOps$ReduceOp::evaluateSequential (18 bytes) 256 85 b java.util.stream.ReduceOps$2::makeSink (5 bytes) 256 86 b java.util.stream.ReduceOps$2::makeSink (12 bytes) 257 87 b java.util.stream.ReduceOps$2ReducingSink::<init> (10 bytes) 257 88 b java.util.stream.AbstractPipeline::wrapAndCopyInto (18 bytes) 259 89 b java.util.stream.AbstractPipeline::wrapSink (37 bytes) 259 90 b java.util.stream.AbstractPipeline::copyInto (53 bytes) 261 91 b java.util.stream.AbstractPipeline::getStreamAndOpFlags (5 bytes) 261 92 b java.util.stream.StreamOpFlag::isKnown (19 bytes) 262 93 b java.util.Spliterator::getExactSizeIfKnown (25 bytes) 262 94 b java.util.ArrayList$ArrayListSpliterator::estimateSize (11 bytes) 263 95 b java.util.ArrayList$ArrayListSpliterator::getFence (48 bytes) 263 96 b java.util.stream.ReduceOps$2ReducingSink::begin (11 bytes) 263 97 b java.util.stream.Sink::end (1 bytes) 264 98 b java.util.stream.ReduceOps$2ReducingSink::get (5 bytes) 264 99 b java.util.stream.ReduceOps$2ReducingSink::get (21 bytes) 265 100 b java.util.Optional::of (9 bytes) 266 101 b java.util.Optional::<init> (13 bytes) 266 102 b java.util.Optional::get (22 bytes) [/sourcecode] In the case of the procedural approach, only a single method compiled: [sourcecode] 183 60 b de.beyondJava.performance.MyBenchmark::testMethodProcedurally (44 bytes) [/sourcecode] I wonder why ArrayList.get() is not part of the list. I can only guess this method has been inlined before the JIT compiler kicks in. Be that as it may, the list of methods to be compiled is a lot shorter with the procedural approach than with the functional approach.

Loop unrolling

Update February 05, 2018: Looking at the Assembly code reveals a second reason why the good old for loop is faster. The JVM can "unroll" the loop. Instead of looping 20 times, it only loops five times, adding four numbers per loop. The assembly code of the simple array show this nicely: [sourcecode] 0x00000001087e1420: add 0x10(%r11,%r8,4),%eax 0x00000001087e1425: movslq %r8d,%r9 0x00000001087e1428: add 0x14(%r11,%r9,4),%eax 0x00000001087e142d: add 0x18(%r11,%r9,4),%eax 0x00000001087e1432: add 0x1c(%r11,%r9,4),%eax 0x00000001087e1437: add $0x4,%r8d 0x00000001087e143b: cmp %ecx,%r8d 0x00000001087e143e: jl 0x00000001087e1420 [/sourcecode] The JVM does the same optimization for the ArrayList version (although a lot more code is required, so it's not the ideal choice for showing it in a blog). It's much more difficult to do the same trick for a stream. I didn't find any trace of loop unrolling in the Assembly code of the stream algorithm. By the way, if you're puzzled by the movslq instruction which simply moves the content of the register r8 to r9 and continues adding using r9 instead of r8. I can only guess, but I think it's something to do with out-of-order execution. r8 isn't used, so the CPU can execute the line add %0x4,%r8d without having to wait for the three other additions to finish. Branch prediction even allows it to prepare the next loop.

Use streams - use'em wisely

The list of compiler activities clearly indicates why stream processing is six times slower than the procedural approach, and even 12 times slower than using a simple array. Streams are clever beasts, allowing for all kinds of optimizations. Among other things, they are good at evaluating things lazily. If you've got an opportunity to use this feature, do it. It'll be the number one performance boost. But if you're using streams to compute simple things, their superior power is wasted.

Streams in the real world

As I've mentioned before, I've carefully chosen an example making streams look bad. In the real world, streams usually do complex operations, such as accessing a database. Databases and file systems are very slow compared to the CPU, so the performance considerations of this article aren't really important. But still, I'd like to keep in mind that using Lambdas doesn't make your algorithm faster. There may or may not be many good reasons to use Lambdas, but performance is hardly ever one of them. Unless you do something about it. Lambdas do have the potential to unleash superior performance. Just think of lazy evaluation and doing things in parallel.

Wrapping it up

I can't mention it often enough: functional programming is great. Especially if you're using it wisely. As to my perception, Java feels sort of like an Assembly language, not unlike JavaScript. It's the rock-solid base you're building great and efficient frameworks on. Or even other programming languages. Many of these programming languages support functional programming much better than Java itself. In the case of Java, functional programming has been added as an afterthought. Granted, many thoughts went into the implementation. I've followed a lot of it, and I'm deeply impressed by the ingenuity of the programmers, designers, and architects. But still, being the rock-solid basis means you have to be extremely flexible, and that comes at a price. The price the Java world is paying is a lot of boiler-plate code and - but that's only an educated guess - a minor performance penalty. Adopting another programming language allows you to get rid of the boiler-plate code. In theory, it's also possible that the new language implements Lambdas in general and streams, in particular, more efficiently. In practice, please verify this first. In general, in 2018, I suppose using Java 8 streams and Lambdas is fast enough to do the trick. You just should be aware of the trade-offs.

Dig deeper

How fast are Java 8 streams? by Angelika Langer Three reasons why you shouldn't replace your for loops with Stream.forEach by Lukas Eder (at least I think he's the author) What are Lambda expressions compiled to? PrintAssembly output explained by Jean-Philippe Bempel, giving you hints what the verbose Java machine code does