; last updated - 2 minutes read

Did you know characters don't always fit into chars? Did you know String::length() does not always return the number of letters of the String? Did you know you can't reliably read a text using a ByteArrayInputStreamReader? Did you know many characters have more than one Unicode encoding?

The latter you should know, at least if you're concerned about writing a secure web application. Hackers sometimes hide their malicious code by using ambiguous UTF-8 encodings. If the request is interpreted correctly, it simply contains an odd character. But if your application interprets the request as simple ASCII code, it contains executable Javascript code. The "<" sign can be hidden as a modifier to a unicode character, so a (careless?) firewall doesn't recognize it as the start of malicious code.

Before leading you to McDowells great "guide to character encoding", I'd like to show a couple of interesting characters to you:

  • The letter A is encoded in UTF-8 as 0x41. That's simply good old 7-bit ASCII code.
  • The french letter é (as in écouter) is part of many 8-bit extensions of ASCII. In UTF-8 it's encoded as C3 A9.
  • However, it also can be constructed as a combination of "e" and the accent "´". That amounts to 0x65CC81 in Unicode. That's the first example of the three byte representation of a character.
  • Characters can become as complicated as क्तु. That's a 12 byte encoding in UTF-8: E0A495 E0A58D E0A4A4 E0A581. As far as I can tell, that's a base character followed by three modifiers.

The last character doesn't fit into a Java char, and "क्तु".length() yields 4 - albeit it's considered a single Devenagari letter.

There's a lot more information on Java character encoding on McDowells "exhausting, but not exhaustive" article Java: a rough guide to character encoding.


Comments