Isn't it funny that I start writing a series of articles about Java 8 Lambdas in 2018? Mind you: my first article dates back to April 2012, six years ago. I fell in love with functional programming even earlier. I suppose it was roughly in 2008, ten years ago. From this perspective, functional programming is nothing new to me. I'm using it on a daily basis when I'm writing TypeScript code. As things go, I work in a very conservative environment, so I know a lot of projects that have adopted Java 8 last year. And I'm bewildered what programmers make of it - and why. It's high time for a reality check. Java's Lambdas and method handles are great, no doubt about that. Streams are great, too. Sometimes. More often than you might think, they are not.
for
loop to a Java 8 stream because your IntelliJ IDE suggests it ("and they know what they're doing!") - well, that's the wrong reason, as Lukas Eder pointed out three years ago. If you do it because it improves the coding style, please abandon Java in favor of a programming language truly supporting functional programming. There are plenty such languages in the Java universe. Java itself forces you to convert arrays to streams just to make functional programming available, and it also forces you to convert Optionals
to primitive classes or streams to Lists
all the time. There's a lot of boiler-plate code you just have to accept. There are good reasons for the boiler-plate code, but even so, other languages achieve the same goal with a lot less ceremony. Maybe they sacrifice some power or flexibility, but they gain a lot of simplicity and maintainability in return.
Mind you: most people propagate functional programming because it allows you to get rid of the boiler-plate code of procedural programming. Java, in turn, encourages a functional programming style with a lot of ceremonies.
'Nuff said. This article is about performance. Maintainability and conciseness where the topic of an earlier article of mine.
for
loops to functional programming, sometimes for the sole reason that their IDE suggests it.
This benchmark uses the JMH framework, which is a great choice to run small benchmarks without falling into the many pitfalls of microbenchmarks. One of these pitfalls is that the JVM optimizer recognizes we're not really doing anything with the data and that the data never changes. More often than not, the JVM simply eliminates the code we'd like to test because it considers it dead code.
To avoid this, we put the numbers we want to add to a state object. Basically, that's just boiler-plate code preparing the real test:
[sourcecode Java]
@State(Scope.Thread)
public static class MyState {
public int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 };
public List<Integer> numberList = new ArrayList<Integer>() {{
IntStream.range(1, 20).forEach(n -> {numberList.add(n);});
}};
}
[/sourcecode]
As you can see, there's an array and an ArrayList
, both containing the same set of numbers. In Java, streams always operate on a Collection
, so I implemented the two algorithms using an ArrayList
in order to compare similar things. However, I also was interested in how much the algorithm can be optimized by using a simple array.
Now for the first two attempts to calculate the sum. The first method uses the procedural style, the second method employs a fancy functional reduce
method:
[sourcecode Java]
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public int testMethodProcedurally(MyState state) {
int sum = 0;
for (int i = 0; i < state.numberList.size(); i++) {
sum += state.numberList.get(i);
}
return sum;
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public int _testMethodWithStreams(MyState state) {
return state.numberList.stream().reduce((a, b) -> a + b).get();
}
[/sourcecode]
invokeDynamic
instruction introduced with Java 8, method handles, and even creating classes during runtime. To verify the latter claim, I deliberately caused an exception in the Lambda expression:
[sourcecode Java]
public int testMethodWithStreams(MyState state) {
return state.numberList.stream().reduce((a, b) ->
{ int crash = 0 / 0; return a + b; })
.get();
}
[/sourcecode]
The stack trace of the resulting exception looks suspiciously like being caused by an inner class:
[sourcecode]
Exception in thread "main" java.lang.ArithmeticException: / by zero
at de.beyondJava.performance.MyBenchmark.lambda$0(MyBenchmark.java:78)
at java.util.stream.ReduceOps$2ReducingSink.accept(ReduceOps.java:123)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1380)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:479)
at de.beyondJava.performance.MyBenchmark.testMethodWithStreams(MyBenchmark.java:78)
at de.beyondJava.performance.MyBenchmark.mainTest(MyBenchmark.java:113)
at de.beyondJava.performance.MyBenchmark.main(MyBenchmark.java:100)
[/sourcecode]
Looking from that angle, the performance penalty doesn't come as a surprise. In fact, many resources on the internet indicate there's a huge performance penalty if Lambdas are only used a few times. The JVM generates a new class on the fly when the Lambda is called for the first time. After that, this class can be reused. That's why so many resources on the internet report a giant performance penalty. Using a Lambda once is expensive, indeed.
As soon as the JIT compiler kicks in, the performance penalty caused by Lambda themselves mostly vanishes.
-Xbatch
switches off a great feature of the JVM. By default, the JVM starts the JIT compiler in a new thread. Unfortunately, this causes log messages to arrive in parallel, making the output hard to read.
-XX:-TieredCompilation
forces the JIT compiler to optimize the assembly code as much as possible from start. Usually, it prefers to optimize the code no less than five times. In general, this is a great compromise between quick startup time of your application and maximum performance of the parts of the application that are really used frequently. However, in this article, we're happy to see the final result of all the optimizations, so we opt to use the C5 compiler directly.
Finally, -XX:+PrintCompilation
prints a line each time a method is compiled to assembly code. I've also examined the assembly code itself, but I'll spare you that because it's surprisingly long. Suffice it to say that the functional approach roughly takes roughly 900 lines of Assembly code, the procedure algorithm using the ArrayList
takes roughly 333 lines of code, and the algorithm using a simple array takes less than 120 lines of code.
-Xbatch -XX:-TieredCompilation -XX:+PrintCompilation
.
Among other things, this reveals which methods are used frequently. Omitting the methods compiled before our method under test is compiled, the functional approach compiles these methods:
[sourcecode]
231 68 b de.beyondJava.performance.MyBenchmark::testMethodWithStreams (29 bytes)
237 69 b java.util.Collection::stream (11 bytes)
238 70 b java.util.ArrayList::spliterator (12 bytes)
239 71 b java.util.ArrayList$ArrayListSpliterator::<init> (26 bytes)
239 72 b java.util.stream.StreamSupport::stream (19 bytes)
241 73 b java.util.stream.ReferencePipeline$Head::<init> (8 bytes)
241 74 b java.util.stream.ReferencePipeline::<init> (8 bytes)
241 75 b java.lang.invoke.LambdaForm$MH/250421012::linkToTargetMethod (8 bytes)
242 76 b java.lang.invoke.LambdaForm$MH/455659002::identity_L (8 bytes)
242 77 b java.util.stream.ReferencePipeline::reduce (12 bytes)
247 78 b java.util.stream.ReduceOps::makeRef (17 bytes)
248 79 b java.util.stream.ReduceOps$2::<init> (11 bytes)
248 80 b java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes)
249 81 b java.util.stream.AbstractPipeline::evaluate (94 bytes)
252 82 b java.util.stream.TerminalOp::getOpFlags (2 bytes)
252 83 b java.util.stream.AbstractPipeline::sourceSpliterator (265 bytes)
253 84 b java.util.stream.ReduceOps$ReduceOp::evaluateSequential (18 bytes)
256 85 b java.util.stream.ReduceOps$2::makeSink (5 bytes)
256 86 b java.util.stream.ReduceOps$2::makeSink (12 bytes)
257 87 b java.util.stream.ReduceOps$2ReducingSink::<init> (10 bytes)
257 88 b java.util.stream.AbstractPipeline::wrapAndCopyInto (18 bytes)
259 89 b java.util.stream.AbstractPipeline::wrapSink (37 bytes)
259 90 b java.util.stream.AbstractPipeline::copyInto (53 bytes)
261 91 b java.util.stream.AbstractPipeline::getStreamAndOpFlags (5 bytes)
261 92 b java.util.stream.StreamOpFlag::isKnown (19 bytes)
262 93 b java.util.Spliterator::getExactSizeIfKnown (25 bytes)
262 94 b java.util.ArrayList$ArrayListSpliterator::estimateSize (11 bytes)
263 95 b java.util.ArrayList$ArrayListSpliterator::getFence (48 bytes)
263 96 b java.util.stream.ReduceOps$2ReducingSink::begin (11 bytes)
263 97 b java.util.stream.Sink::end (1 bytes)
264 98 b java.util.stream.ReduceOps$2ReducingSink::get (5 bytes)
264 99 b java.util.stream.ReduceOps$2ReducingSink::get (21 bytes)
265 100 b java.util.Optional::of (9 bytes)
266 101 b java.util.Optional::<init> (13 bytes)
266 102 b java.util.Optional::get (22 bytes)
[/sourcecode]
In the case of the procedural approach, only a single method compiled:
[sourcecode]
183 60 b de.beyondJava.performance.MyBenchmark::testMethodProcedurally (44 bytes)
[/sourcecode]
I wonder why ArrayList.get()
is not part of the list. I can only guess this method has been inlined before the JIT compiler kicks in. Be that as it may, the list of methods to be compiled is a lot shorter with the procedural approach than with the functional approach.
ArrayList
version (although a lot more code is required, so it's not the ideal choice for showing it in a blog). It's much more difficult to do the same trick for a stream. I didn't find any trace of loop unrolling in the Assembly code of the stream algorithm.
By the way, if you're puzzled by the movslq
instruction which simply moves the content of the register r8 to r9 and continues adding using r9 instead of r8. I can only guess, but I think it's something to do with out-of-order execution. r8 isn't used, so the CPU can execute the line add %0x4,%r8d
without having to wait for the three other additions to finish. Branch prediction even allows it to prepare the next loop.