Did you know characters don't always fit into chars? Did you know String::length()
does not always return the number of letters of the String
? Did you know you can't reliably read a text using a ByteArrayInputStreamReader
? Did you know many characters have more than one Unicode encoding?
The latter you should know, at least if you're concerned about writing a secure web application. Hackers sometimes hide their malicious code by using ambiguous UTF-8 encodings. If the request is interpreted correctly, it simply contains an odd character. But if your application interprets the request as simple ASCII code, it contains executable Javascript code. The "<" sign can be hidden as a modifier to a unicode character, so a (careless?) firewall doesn't recognize it as the start of malicious code.
Before leading you to McDowells great "guide to character encoding", I'd like to show a couple of interesting characters to you:
- The letter A is encoded in UTF-8 as 0x41. That's simply good old 7-bit ASCII code.
- The french letter é (as in écouter) is part of many 8-bit extensions of ASCII. In UTF-8 it's encoded as C3 A9.
- However, it also can be constructed as a combination of "e" and the accent "´". That amounts to 0x65CC81 in Unicode. That's the first example of the three byte representation of a character.
- Characters can become as complicated as क्तु. That's a 12 byte encoding in UTF-8: E0A495 E0A58D E0A4A4 E0A581. As far as I can tell, that's a base character followed by three modifiers.
The last character doesn't fit into a Java char
, and "क्तु".length()
yields 4 - albeit it's considered a single Devenagari letter.
There's a lot more information on Java character encoding on McDowells "exhausting, but not exhaustive" article Java: a rough guide to character encoding.