Java 8: Major Speed Boost by Overhauled String API

Posted on Posted in fun, Java 8, Performance

April 1st, 20141 2 Java 8 is out for a week or two now, and we all have been puzzled by the tremendous speed improvements of Oracle’s newest coup. String operations in particular have improved a lot. It’s time to look into the guts of Java and to analyze what’s going on.

Along the way, I also found out why Pattern.compile() has been deprecated in Java 8. Please stop using it. It’s a big performance penalty.

It took me a while to find out. Only when I looked into the byte code generated by Java 8, I realized what has happened. Oracle has overhauled its String implementation. Actually they have rewritten it from scratch. They need to retain the backward-compatibility, so it wasn’t that easy to see the trick: they abandoned 16-bit Unicode, replacing it by a more modern encoding. A major step to a more memory-efficient design! The Java community has been longing for this move since ages.

The currently valid version of Unicode stems from 1996. It clearly shows signs of age. 1996 was the era of Pentium processors. The majority of desktop PC already ran on 32-bit systems. However, the history of Unicode reaches back until 1991, and the first draft of a universal character set was writting in 1988. 16-bit processors were still abundant in many branches of the industry at the time. The Unicode standardization group simply couldn’t afford to ignore such a big section of the market, so they opted for backward-compatibility and defined an 16-bit charset.

In a way the decision looked a bit odd, even back in 1996, as 16-bit processors started to become extinct. Nonetheless the changes brought by Unicode 2.0 were widely considered an evolution instead of a revolution. So the Unicode team decided to stick to 16-bit encoding, and so did the first version of Java, which also was released in 1996.

Of course using Unicode appears a little old-fashioned in 2014. AMD has been offering 64-bit processing since 2003, quickly followed by Intel in 2004. So almost every PC in the industry runs on 64 bit. Sticking to Unicode amounts to wasting 48 bit of your precious processor. Obviously the complaints of ecological lobby groups started to make an impact. Can you imagine how many power plants can’t be shut down just to support millions of those idle 48 bits? The ecological footprint of Unicode isn’t acceptable in the age of global warming.

So the Java language team decided to adopt a modern, memory-efficient 64-bit character encoding in Java 8. Needless to say that this encoding speeds your programs by a factor a four. However, Java 8 programs – especially those programs dealing principally with texts – are a lot faster than “merely” four times the speed of Java 7. How come?

Well, there’s more. Careful inspection of the byte code revealed a big surprise. Every Java String is represented by a single 64 bit number. A member of the Java team told me “there are only so many Strings a guy can think of. All we had to do was to count them and to assign a number to each of them.” The effect is impressive: Java class files compiled by Java 8 are significantly smaller than programs compiled by Java 7. Plus, many String operations can be processed extremely efficient. Concatenating Strings boils down to adding the two numbers representing the Strings, String.substr() becomes a simple subtraction, and so on.

Even pattern matching can now be implemented efficiently. Regular expressions are converted to a simple formula that can be executed by your processor’s floating point unit in just a few cycles. In fact, it’s so fast that Pattern.compile had to be declared deprecated: Depending on whether the formula is stored in the L1 cache, the L2 cache or even in main memory there’s frequently a huge performance penalty assigned to storing a pattern. It’s way cheaper to recompile it in the FPU. Remember, no matter how complex your regular expression is, it’s stored in a single 64 bit number, allowing for very fast compilation.

Let’s finish the article with an example. I’ve written a small Java class assigning this blog article to a Java String, compiled it and inspected it with a hex editor. The article is represented by 4715394352932679532. Quite a huge number, but remember, the basic word width of your processor is 64 bits, and 4715394352932679532 even fits in 62 bits. Granted, that’s still wasting two bits that remain idle, but it’s a major progress to the outdated Unicode encoding. I marked the String representing the article in the hardcopy of the hex dump of the class file in order to make it easier to find (mind you, Java 8 makes Strings so compact that they are easy to miss):

Hex Dump of a Java 8 class using efficient 64-bit Strings

Further reading:

The museum of April hoaxes
April hoaxes and easter eggs by Google
Google Maps treasure hunt

  1. Yes, this was this year’s April fool’s hoax.
  2. But read my follow-up article to learn about the real improvements of the string implementation.

Leave a Reply

Your email address will not be published.