Notes on software development: May 2014

Nice definition of MonolithicDesign taken from http://c2.com/cgi/wiki?MonolithicDesign :

Characteristics of MonolithicDesign:

Functionality implemented by part of the system cannot be reused without using the entire system.
To make one part of the system work, other parts must be "tricked" by using them, even if they aren't logically needed. (For example, you might need to "pump" a file reader, even if your data is coming from another source.)
Initialization of the system may be tricky or laborious.
Change to the control flow is impossible.
The only escape from MonolithicDesign is to spend months refactoring and rewriting the system into independent modules.

Factors leading to MonolithicDesign:

A bad sense of aesthetics. (This above all.)
Procrastination of refactoring.
Premature optimization, especially a tendency to performance perfectionism or Puritanism.
Not writing for reuse.
Tunnel vision or attachment that limits your vision to one architecture, one flow paradigm, one memory management technique, etc.

How to prevent MonolithicDesign:

Code for survivability, not optimal fit. The more perfectly something is adapted to its environment, the less it can tolerate change in that environment. When you find yourself expending insane effort to maintain a perfectly static environment for your perfectly adapted code, you are probably dealing with MonolithicDesign. When you write modules that can be used independently from each other in varying architectural contexts, you are protected from MonolithicDesign.
Refactor often, and focus on eliminating dependencies. Examine the relevance of every module that you are forced to use.
Take advantage of opportunities to work with a variety of paradigms and techniques, so that you learn to recognize and eliminate unnecessary limitations in module functionality.
Practice proactive laziness; i.e., expand your vocabulary, not just your repertoire. Developer 1 writes a program that must perform task X. Developer 1 writes the program and says, "Now I know how to write programs that do X;" he has expanded his repertoire. Developer 2 writes a program that must perform task X. Developer 2 writes a module to do task X, uses it in his program, and says, "Now I have a module that does X." Developer 2 has expanded his vocabulary, because now he can accomplish X by invoking the name of his module. When developer 1 needs to write a new program that does X, he will be tempted to tack the functionality onto his first program, bloating and complicating that program and starting the trend toward MonolithicDesign.
ReduceCoupling

Sometimes we work with simple things thinking that they are simple. However they are horribly difficult, and it's simplicity is our dark. Do you know Unicode? Or do you really understand what it means for developer that Java strings support UTF-16?

Here is a simple test:
1) Is Java Character being implementation of any Unicode symbol (Code Point)?
2) Do you know what surrogate character is?
3) What will this app print?

          String s = new String(Character.toChars(0x10FFFF));
          System.out.println( s.length() );

If you answered respectivelly Yes, No and "1" then you definetly should read the the following links asap.

Our world is imperfect and many models are ideal just in our minds. I guess you know famous law of leaky abstractions . Unicode is just an abstraction whose implementations UTF-8,UTF16, despite being broadly spread around the world, do not provide solid shield over implementation details. We have to keep in mind them.

First of all, let's distinguish string representation as byte stream, which might use any encoding, and internal String's data representation inside JVM and Java application. Here we are considering the last one and what it means for developer.

All we know how UTF-8 works. Unicode symbol or rather code point might have variable length of bytes. However many developers do not know that UTF-16 has the same issue. They are lulled by the fact that almost all famous unicode symbols can be palced inside 16 bits (Basic Multilingual Plane or BMP).

Almost all is not all! So there are unicode symbols that require two 16 bit characters. That is why example above returns 2. Compare it with this one which represents ordinary symbol:

        System.out.println((new String(Character.toChars(0x0061))).length()); // latin 'a'

This example returns 1.

What it means for developer? It means that strictly speaking Java UTF-16 strings are not directly indexable in general case. Well, almost all usual applications which process text, beginning from character count to word processors are... wrong or rather might be wrong on some exotic input text.
So developing Java apps to process any possible text we have to keep in mind this:

there is extended Java API to work with code points
Some unicode characters might be supplementary (i.e. greater than U+FFFF ). In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
it is important to respect surrogate characters and check them in runtime through Character.isHighSurrogate/Character.isLowSurrogate:

System.out.println(Character.isHighSurrogate((new String(Character.toChars(0x0061))).charAt(0)));  // char 'a', returns false
String str = (new String(Character.toChars(0x10FFFF)));
System.out.println(Character.isHighSurrogate(str.charAt(0))); // returns true
System.out.println(Character.isLowSurrogate(str.charAt(1)));  // returns true

There is also UTF-32 which trully allows to represent any code points. Another benefit is that the code points are directly indexable. The down side is that it requires a lot of memory to represent something, indeed, it will requires 2x memory then ordinary UTF-16 text. Unfortunatelly memory issue is not the biggest problem. Java language was developed at times when trees were large and people thought that 16 bits would be enaugh to store everything. UTF-16 is integrated quite deeply into Java so we are hostages.

As many people around, I expect that some day UTF-32 will be used everywhere and unless... we join Galactic Federation and new code points arrived.

Notes on software development

Friday, May 23, 2014

Monolithic Design?

Monday, May 5, 2014

Do you know Unicode?