Remember my last April prank? I span a yarn about the great way Oracle managed to squeeze every conceivable
String into a 64-bit number. Funny thing is my vision comes sort of true. Of course, not the way I claimed. That was nonsense meant to be easily seen through (“There are only so many
Strings a gal or a guy can think of – so everything you have to do is to assign them a 64-bit number”). But it’s true that a lot of work at Oracles’ is dedicated to optimize
Java 7 Update 6 improved the speed of
String.substring() by sacrificing a little memory efficiency (see my article Recent Improvements of Java’s String Implementation). Java 8 Update 20 takes the opposite approach: it sacrifices a little CPU efficiency in order to reduce memory footprint. On the long run this should reduce the strain on the CPU, too. In other words: most real worlds programs should run faster.
Everybody who’s analyzing a Java program in a profiler can’t avoid noticing: Java programs create incredible quantites of character arrays. Most of them are part of String objects. Java represents its
String as objects, meaning there’s a pointer to a character array. That’s not exactly the most efficient way to represent a sequence of characters. It’s just what you get if you want to represent a
String as an object instead of defining it as a native primitive type. However, the developers of Java got aware of the problem a long time ago, so they invented the
String.intern() method. In general, it’s a bad idea to call this method by yourself (because you’re trying to outperform the JVM’s optimization), but sometimes it reduces your application’s memory footprint tremendously.
Putting it in a nutshell,
String deduplication is a clever way of calling
String.intern() as part of the garbage collection. Since Java 8 U20 the garbage collector recognizes duplicates
String and merges them. As mentioned above, this costs some CPU power, but it shouldn’t be much of a concern because the garbage collector runs in its own thread. Plus, on the long run the reduced memory footprint makes the garbage collector run faster.
There’s a discussion on reddit indicating
String deduplication really works, but sometimes you have to adjust the JVM parameters. I suppose that’s one of the reasons why the feature isn’t activate by default yet. You have to activate manually by starting the JVM with the parameter
Still, I sometimes wonder why a Java
String has to be an object. Back in the seventies or the eighties, when the first languages became powerful enough to represent a
String by means of the language instead of defining it as a primitive type, it quickly become a fashion to do so. Probably there are other reasons, but I always had the impression language designers considers this a sign of their language’s maturity. But being able to express a
String as a library object doesn’t mean to have to do so. If Java were to use zero-terminated character arrays (the way BASIC does) or if it were to use a character array with a preceding length byte, hardly any developer would notice. But the implementation gets rid of a pointer, simplifying memory management and garbage collection. The only convincing advantage of making
Strings part of the language libraries is the ability to derive custom classes. Sadly, the Java developers prohibited this very early in the Java history by making
String a final class. They did it for a good reason – but still, it’s a pity. Groovy’s GStrings show what you can do when you allow to derive from
That said, I’d like to point you to Java Performance Tuning Guide. They’ve written an in-depth article about
Java Performance Tuning Guide on String deduplication
discussion on reddit on the topic (well, one of them – possibly the most interesting one)