Did you know characters don’t always fit into chars?
Did you know
String::length() does not always return the number of letters of the
Did you know you can’t reliably read a text using a
Did you know many characters have more than one Unicode encoding?
Before leading you to McDowells great “guide to character encoding”, I’d like to show a couple of interesting characters to you:
- The letter A is encoded in UTF-8 as 0x41. That’s simply good old 7-bit ASCII code.
- The french letter é (as in écouter) is part of many 8-bit extensions of ASCII. In UTF-8 it’s encoded as C3 A9.
- However, it also can be constructed as a combination of “e” and the accent “´”. That amounts to 0x65CC81 in Unicode. That’s the first example of the three byte representation of a character.
- Characters can become as complicated as क्तु. That’s a 12 byte encoding in UTF-8: E0A495 E0A58D E0A4A4 E0A581. As far as I can tell, that’s a base character followed by three modifiers.
The last character doesn’t fit into a Java
"क्तु".length() yields 4 – albeit it’s considered a single Devenagari letter.
There’s a lot more information on Java character encoding on McDowells “exhausting, but not exhaustive” article Java: a rough guide to character encoding.