Monday, May 5, 2014

Do you know Unicode?

Sometimes we work with simple things thinking that they are simple. However they are horribly difficult, and it's simplicity is our dark. Do you know Unicode? Or do you really understand what it means for developer that Java strings support UTF-16?

Here is a simple test:
1) Is Java Character being implementation of  any Unicode symbol (Code Point)?
2) Do you know what surrogate character is?
3) What will this app print?
          String s = new String(Character.toChars(0x10FFFF));
          System.out.println( s.length() );

If you answered respectivelly Yes, No and "1" then you definetly should read the the following links asap.

Our world is imperfect and many models are ideal just in our minds. I guess you know famous law of leaky abstractions . Unicode is just an abstraction whose implementations UTF-8,UTF16, despite being broadly spread around the world, do not provide solid shield over implementation details. We have to keep in mind them.

First of all, let's distinguish string representation as byte stream, which might use any encoding, and internal String's data representation inside JVM and Java application. Here we are considering the last one and what it means for developer.

All we know how UTF-8 works. Unicode symbol or rather code point might have variable length of bytes. However many developers do not know that UTF-16 has the same issue. They are lulled by the fact that almost all famous unicode symbols can be palced inside 16 bits (Basic Multilingual Plane or BMP).

Almost all is not all! So there are unicode symbols that  require two 16 bit characters. That is why example above returns 2. Compare it with this one which represents ordinary symbol:

        System.out.println((new String(Character.toChars(0x0061))).length()); // latin 'a'
This example returns 1.

What it means for developer? It means that strictly speaking Java UTF-16 strings are not directly indexable in general case. Well, almost all usual applications which process text, beginning from character count to word processors are... wrong or rather might be wrong on some exotic input text.
So developing Java apps to process any possible text we have to keep in mind this:
  • there is extended Java API to work with code points
  • Some unicode characters might be supplementary (i.e. greater than U+FFFF ). In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
  • it is important to respect surrogate characters and check them in runtime through Character.isHighSurrogate/Character.isLowSurrogate:
System.out.println(Character.isHighSurrogate((new String(Character.toChars(0x0061))).charAt(0)));  // char 'a', returns false
String str = (new String(Character.toChars(0x10FFFF)));
System.out.println(Character.isHighSurrogate(str.charAt(0))); // returns true
System.out.println(Character.isLowSurrogate(str.charAt(1)));  // returns true 

There is also UTF-32 which trully allows to represent any code points. Another benefit is that the code points are directly indexable. The down side is that it requires a lot of memory to represent something, indeed, it will requires 2x memory then ordinary UTF-16 text. Unfortunatelly memory issue is not the biggest problem. Java language was developed at times when trees were large and people thought that 16 bits would be enaugh to store everything. UTF-16 is integrated quite deeply into Java so we are hostages.

As many people around, I expect that some day  UTF-32 will be used everywhere and unless... we join Galactic Federation and new code points arrived.

No comments: