; last updated - 4 minutes read

Remember my last April prank? I span a yarn about the great way Oracle managed to squeeze every conceivable String into a 64-bit number. Funny thing is my vision comes sort of true. Of course, not the way I claimed. That was nonsense meant to be easily seen through ("There are only so many Strings a gal or a guy can think of - so everything you have to do is to assign them a 64-bit number"). But it's true that a lot of work at Oracles' is dedicated to optimize String management.

Java 7 Update 6 improved the speed of String.substring() by sacrificing a little memory efficiency (see my article Recent Improvements of Java’s String Implementation). Java 8 Update 20 takes the opposite approach: it sacrifices a little CPU efficiency in order to reduce memory footprint. On the long run this should reduce the strain on the CPU, too. In other words: most real worlds programs should run faster.

Everybody who's analyzing a Java program in a profiler can't avoid noticing: Java programs create incredible quantites of character arrays. Most of them are part of String objects. Java represents its String as objects, meaning there's a pointer to a character array. That's not exactly the most efficient way to represent a sequence of characters. It's just what you get if you want to represent a String as an object instead of defining it as a native primitive type. However, the developers of Java got aware of the problem a long time ago, so they invented the String.intern() method. In general, it's a bad idea to call this method by yourself (because you're trying to outperform the JVM's optimization), but sometimes it reduces your application's memory footprint tremendously.

Putting it in a nutshell, String deduplication is a clever way of calling String.intern() as part of the garbage collection. Since Java 8 U20 the garbage collector recognizes duplicates String and merges them. Or rather, it merges the underlying character arrays, as our reader LukeU kindly remarks in the comment section. Replacing the String objects bears the risk of unwanted side effects, so calling String.intern() manually still frees a bit more memory. This approach costs some CPU power, but it shouldn't be much of a concern because the garbage collector runs in its own thread. Plus, on the long run the reduced memory footprint makes the garbage collector run faster.

There's a discussion on reddit indicating String deduplication really works, but sometimes you have to adjust the JVM parameters. I suppose that's one of the reasons why the feature isn't activate by default yet. You have to activate manually by starting the JVM with the parameter

-XX:+UseG1GC -XX:+UseStringDeduplication

Still, I sometimes wonder why a Java String has to be an object. Back in the seventies or the eighties, when the first languages became powerful enough to represent a String by means of the language instead of defining it as a primitive type, it quickly become a fashion to do so. Probably there are other reasons, but I always had the impression language designers considers this a sign of their language's maturity. But being able to express a String as a library object doesn't mean to have to do so. If Java were to use zero-terminated character arrays (the way BASIC does) or if it were to use a character array with a preceding length byte[1], hardly any developer would notice. But the implementation gets rid of a pointer, simplifying memory management and garbage collection. The only convincing advantage of making Strings part of the language libraries is the ability to derive custom classes. Sadly, the Java developers prohibited this very early in the Java history by making String a final class. They did it for a good reason - but still, it's a pity. Groovy's GStrings show what you can do when you allow to derive from Strings.

That said, I'd like to point you to Java Performance Tuning Guide. They've written an in-depth article about Stringdeduplication.


Java Performance Tuning Guide on String deduplication

discussion on reddit on the topic (well, one of them - possibly the most interesting one)


  1. or a 64-bit word in the age of Gigabyte memories - and the characters should be at least 16-bit to support UTF-16. I used "byte" and "character" as a figure of speech. Is there a catchy word for 16 or 32-bit integers?↩

Comments