Uncategorized

Newsflash: Java, UTF-8 Surprises and Character Encodings

Did you know characters don’t always fit into chars?
Did you know String::length() does not always return the number of letters of the String?
Did you know you can’t reliably read a text using a ByteArrayInputStreamReader?
Did you know many characters have more than one Unicode encoding?

The latter you should know, at least if you’re concerned about writing a secure web application. Hackers sometimes hide their malicious code by using ambiguous UTF-8 encodings. If the request is interpreted correctly, it simply contains an odd character. But if your application interprets the request as simple ASCII code, it contains executable Javascript code. The “<” sign can be hidden as a modifier to a unicode character, so a (careless?) firewall doesn’t recognize it as the start of malicious code.

Before leading you to McDowells great “guide to character encoding”, I’d like to show a couple of interesting characters to you:

  • The letter A is encoded in UTF-8 as 0x41. That’s simply good old 7-bit ASCII code.
  • The french letter é (as in écouter) is part of many 8-bit extensions of ASCII. In UTF-8 it’s encoded as C3 A9.
  • However, it also can be constructed as a combination of “e” and the accent “´”. That amounts to 0x65CC81 in Unicode. That’s the first example of the three byte representation of a character.
  • Characters can become as complicated as क्तु. That’s a 12 byte encoding in UTF-8: E0A495 E0A58D E0A4A4 E0A581. As far as I can tell, that’s a base character followed by three modifiers.

The last character doesn’t fit into a Java char, and "क्तु".length() yields 4 – albeit it’s considered a single Devenagari letter.

There’s a lot more information on Java character encoding on McDowells “exhausting, but not exhaustive” article Java: a rough guide to character encoding.

2 thoughts on “Newsflash: Java, UTF-8 Surprises and Character Encodings

  1. Re: your second bullet point, “In UTF-8 [é is] encoded as 0xE9.” That’s incorrect. The Unicode code point for “é” is 0xE9. In UTF-8 it’s encoding is C3 A9.

    1. Thanks! When I corrected the article, I found a second typo :). Obviously, there’s no such thing as a finished article. Even if it’s only a newsflash…

Leave a Reply

Your email address will not be published.