Friday, June 27, 2014

Helpful findings on Scalding

Many of us are excited by such bold DSL like Scalding/Cascading. It helps to develop clear and understandable Hadoop applications in a few lines! That is really awesome. Truly stream-driven approach.

Like any DSL, if you use strait-forward features, everything goes well. However once you need something extraordinary [somehow I need this quite often] then DSL is your enemy and ... wasting hours to fight against the framework and looking for workarounds. I believe many of us faced this in the past.

So Scalding has the same sickness. Here is list of my findings that were born in the throes.

1. Accessing to filename of processing data chunk from Scalding functions
Sometimes there is a need to access to filename of processing data chunk from map function in order to retrieve additional useful information.  My personal example was this. I had to read huge hive data directly from filesystem avoiding use of hive or hcatalog interfaces. In that map function I needed values of partition keys that usually  are not stored inside the data. Fortunatelly Hive uses hierahical way to store data, keeping  each partition key in separeted directory. So my idea was to retrieve all partition keys from the file path. That was my scenario. However in some cases it is quite useful to have direct access to Cascading's FlowProcess. 

Unfortunatelly there is no simple way to access that object, at least I have not found any official reference in API.
Good news are that Scalding has kind of tear, through which  we can pull out reference to FlowProcess object and I must say quite legally J . Here is an example:


//configuring source and simple map
TSV(files).read
  .map('line ->('cookieId,'segName,'year,'month,'day,'hour)) { line =>

    //create fake Stat object, it does not matter what values are used with.
    val hfp = Stat("123","123").flowProcess.asInstanceOf[HadoopFlowProcess]
    val mis =  hfp.getReporter().getInputSplit.asInstanceOf[MultiInputSplit]
    val fs = mis.getWrappedInputSplit.asInstanceOf[FileSplit]
      
    // my case requires filename however there dozen other useful methods in FlowProcess
    val fileName = fs.getPath.toString
    ...


Important note is that this approach is appliable only for file-based sources!

2. Distributed Cache in Scalding
No need to say that sometimes there is specific data we would like to have access to from mapreduce jobs. It might be external config file or just big input parameter. Accesing these data through distributed cache drammatically improves performance.


//somewhere ouside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
    
// next value can be passed through any Scalding's jobs via Args object for instance
val fileName = fl.path
...  
class Job(val args:Args) {
  // once we receive fl.path we can read it like a ordinary file 
  val fileName = args.get("fileName")
  lazy val data = readJSONFromFile(fileName)
   ...
   TSV(args.get("input")).read.map('line -> 'word ) {
                 line =>  ... /* using data json object*/ ... }
}


3. Hadoop Counters
Another important thing is to access Hadoop counters. Counters is very nice way to collect some business statistics from data. Developer just have to place the counters in code.
Counter usage in Scalding is simple if you know what it is look like. Aha, Stat object might be used to update counters:



Stat("jdbc.call.counter","myapp").incBy(1)


P.S.
Aside of that, just can't get over. Today Ukrainian President pronounced historical speech during assignment with EU /06/27/2014/. It was awesome, he is the first president I am proud of. Slava Ukraini!

Friday, May 23, 2014

Monolithic Design?

Nice definition of MonolithicDesign taken from  http://c2.com/cgi/wiki?MonolithicDesign :

Characteristics of MonolithicDesign:
  • Functionality implemented by part of the system cannot be reused without using the entire system.
  • To make one part of the system work, other parts must be "tricked" by using them, even if they aren't logically needed. (For example, you might need to "pump" a file reader, even if your data is coming from another source.)
  • Initialization of the system may be tricky or laborious.
  • Change to the control flow is impossible.
  • The only escape from MonolithicDesign is to spend months refactoring and rewriting the system into independent modules.

Factors leading to MonolithicDesign:
  • A bad sense of aesthetics. (This above all.)
  • Procrastination of refactoring.
  • Premature optimization, especially a tendency to performance perfectionism or Puritanism.
  • Not writing for reuse.
  • Tunnel vision or attachment that limits your vision to one architecture, one flow paradigm, one memory management technique, etc.
How to prevent MonolithicDesign:
  • Code for survivability, not optimal fit. The more perfectly something is adapted to its environment, the less it can tolerate change in that environment. When you find yourself expending insane effort to maintain a perfectly static environment for your perfectly adapted code, you are probably dealing with MonolithicDesign. When you write modules that can be used independently from each other in varying architectural contexts, you are protected from MonolithicDesign.
  • Refactor often, and focus on eliminating dependencies. Examine the relevance of every module that you are forced to use.
  • Take advantage of opportunities to work with a variety of paradigms and techniques, so that you learn to recognize and eliminate unnecessary limitations in module functionality.
  • Practice proactive laziness; i.e., expand your vocabulary, not just your repertoire. Developer 1 writes a program that must perform task X. Developer 1 writes the program and says, "Now I know how to write programs that do X;" he has expanded his repertoire. Developer 2 writes a program that must perform task X. Developer 2 writes a module to do task X, uses it in his program, and says, "Now I have a module that does X." Developer 2 has expanded his vocabulary, because now he can accomplish X by invoking the name of his module. When developer 1 needs to write a new program that does X, he will be tempted to tack the functionality onto his first program, bloating and complicating that program and starting the trend toward MonolithicDesign.
  • ReduceCoupling

Monday, May 5, 2014

Do you know Unicode?

Sometimes we work with simple things thinking that they are simple. However they are horribly difficult, and it's simplicity is our dark. Do you know Unicode? Or do you really understand what it means for developer that Java strings support UTF-16?

Here is a simple test:
1) Is Java Character being implementation of  any Unicode symbol (Code Point)?
2) Do you know what surrogate character is?
3) What will this app print?
          String s = new String(Character.toChars(0x10FFFF));
          System.out.println( s.length() );

If you answered respectivelly Yes, No and "1" then you definetly should read the the following links asap.

Our world is imperfect and many models are ideal just in our minds. I guess you know famous law of leaky abstractions . Unicode is just an abstraction whose implementations UTF-8,UTF16, despite being broadly spread around the world, do not provide solid shield over implementation details. We have to keep in mind them.

First of all, let's distinguish string representation as byte stream, which might use any encoding, and internal String's data representation inside JVM and Java application. Here we are considering the last one and what it means for developer.

All we know how UTF-8 works. Unicode symbol or rather code point might have variable length of bytes. However many developers do not know that UTF-16 has the same issue. They are lulled by the fact that almost all famous unicode symbols can be palced inside 16 bits (Basic Multilingual Plane or BMP).

Almost all is not all! So there are unicode symbols that  require two 16 bit characters. That is why example above returns 2. Compare it with this one which represents ordinary symbol:

        System.out.println((new String(Character.toChars(0x0061))).length()); // latin 'a'
This example returns 1.

What it means for developer? It means that strictly speaking Java UTF-16 strings are not directly indexable in general case. Well, almost all usual applications which process text, beginning from character count to word processors are... wrong or rather might be wrong on some exotic input text.
So developing Java apps to process any possible text we have to keep in mind this:
  • there is extended Java API to work with code points
  • Some unicode characters might be supplementary (i.e. greater than U+FFFF ). In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
  • it is important to respect surrogate characters and check them in runtime through Character.isHighSurrogate/Character.isLowSurrogate:
System.out.println(Character.isHighSurrogate((new String(Character.toChars(0x0061))).charAt(0)));  // char 'a', returns false
String str = (new String(Character.toChars(0x10FFFF)));
System.out.println(Character.isHighSurrogate(str.charAt(0))); // returns true
System.out.println(Character.isLowSurrogate(str.charAt(1)));  // returns true 

There is also UTF-32 which trully allows to represent any code points. Another benefit is that the code points are directly indexable. The down side is that it requires a lot of memory to represent something, indeed, it will requires 2x memory then ordinary UTF-16 text. Unfortunatelly memory issue is not the biggest problem. Java language was developed at times when trees were large and people thought that 16 bits would be enaugh to store everything. UTF-16 is integrated quite deeply into Java so we are hostages.

As many people around, I expect that some day  UTF-32 will be used everywhere and unless... we join Galactic Federation and new code points arrived.

Friday, April 25, 2014

Monday, March 10, 2014

Why I like Scala (part 2)

It happens sometimes that nature of ordinary features is more interesting and more tricky than it is on the first glance.
Let's get back to example from previous post with using function and reconsider it again:

val resource : Closable = ... // take closable resource
using(resource){
  // return something meaningful
}

Many of you might have noticed that implementation based on Closable trait  did not support sugar like accessing resource variable inside using's code block without external variable. That is the problem which lies on fact that using is ordinary carried function, and there is no way to share arguments between curried calls, at least obvious and simple way. Well, let's add some sugar to using implementation.

Attempt 1
The first idea and right idea that comes up is to use lambda like this
def using[B]( a : Closable)(code : Closable => B) = {
    try{
      code(a)
    }
    finally{
      a.close
    }
  }

Then we might use it this way:
using(getResourceFromSomewhere()){ resource =>
  // doing something with resource
}
The problem is that type of resource is Closable and it is only one method close we can call there. Verdict: Unacceptable.

Attempt 2
Let's open up some information about type to using function.

def using[A <: Closable, B]( a : A)(code :A => B) = {
    try{
      code(a)
    }
    finally{
      a.close
    }
  }
Generics make some fog here, this solution will work nice for objects that already implement Closable trait. What's if we try to "protect" foreign objects that does not know nothing about Closable?
Previous post used implicit boxing. What if we used it again?

implicit def transform(r : InputStream) = new Closable{
  override def close(){
    r.close();
  }
}

...
using(new FileInputStream("/myfile")){ res =>
  // read from resource resource 
}
Alas. Variable res will have Closable  type again.
Verdict: Unacceptable.

Attempt 3
Eventially the idea is to provide entire information about type to our using function without narowing and... to add dedicated implicit convertion function.
def using[A, B]( a : A)(code :A => B)(implicit convertor :A => Closable) = {
    try{
      code(a)
    }
    finally{
      convertor(a).close
    }
  }

Aha, it works now:

implicit def transform(r : InputStream) = new Closable{
  override def close(){
    r.close();
  }
}

...
using(new FileInputStream("/myfile")){ res =>
  // read from resource
  res.read()
}


The last thing is to add implicit convertion from Closable to Closable. Indeed,  calling using with Closable instance will not work because there is no implicit convertion  in the current scope. So we have to add default implicit implementation like this:

implicit def trivialConvertor(a: Closable) = a


Tadam. There are many ideas how to continue improving using feature, but let's move on to something else. Stay tuned!

Monday, February 3, 2014

Сказ о сервере и его друзьях Rackspace и Hazelcast

Жил был Java сервер. Такой себе простенький сервер, он очень любил людей и каждый день дарил им радость и сессии. Однажды злые люди решили побить сервер и договорились ровно в полночь напасть на него. Много было злых людей, но еще было больше у сервера друзей. Не знали злые люди, что  сервер был не простой а облачный. Ровно в полночь маги из Rackspace вызвали заклинание клонирования и у нашего сервера появилось 50 клонов. Далее добрая фея Hazelcast расшарила состояние между всеми клонами с помощью мощного заклинания multicast. К слову заклинание было настолько мощное, что состояние шарилось между клонами без помощи дополнительных серверов. Ну а здоровенный рыцарь лоад балансер стал играть мускулами, справедливо раскидывая нагрузку по всем узелкам уважая слабых и нагружая сильных.

Помощь подошла как раз вовремя, и сразу после того как  последнее заклинание было наложено, стали видны лица тысячи злых пользователей. Они создавали сессии как проклятые, рвали и грызли соединения зубами.  Но бессильны оказались их зубы, слабыми оказались их силы и отступили они с позором.
И не было в тот день более довольного сервера, чем наш. Пирушку клоны закатили тогда не просыхаемую. Несколько дней шумели они, а лоад балансер, напившись, раздавал всем сессии бесплатно.

На последок оставили друзья серверу свои заклинания клонирования от Rackspace и  и легковестный шаринг данных от Hazelcast и стал с того времени сервер поживать и врагов отгонять.

Friday, January 31, 2014

AutoMapper for Java

In a nutshell ordinary web service takes entities from datasource (our domain model) and transforms them into DTO objects (our view models) or vise versa. This is quite popular approach becase it allows evolving domain and view models separately. Also it is neutral because one entity might have more than one narrow view models depending on needs. This is direct way to CQRS approach where read and write models are different at least for consumer's perspective.

 I do not consider another approach here where domain entities are intended to be mapped directly into json or xml. It is inextensible and ugly from my perspective however is quite attractive and simple.

Well, as mentioned I prefer the first approach, and if you had used it as well  then you definetly had to face with developing convertors layer. This is the most annoying and dissapointing part and place where bugs live in prosperity.

Since .NET I liked well-known library AutoMapper, it is lightweight smart convertor which allows transparent and codeless converting between domain and view models. For many cases it was Holy Grail.
I expected similar lib for Java as well and yes found it:  ModelMapper.

Consider this  example:

class User { 
  String firstName;
  String email;
  String creditCard;
}
 
class UserDTO{  
  String firstName;
  String email;
}

And lets consider how it might be transformed:
User user = ...
ModelMapper modelMapper = new ModelMapper();
UserDTO userDTO = modelMapper.map(user, UserDTO.class);


Thinking it is trivial? Well, how about this?
class User {
  Date burthday;
  String firstName;
  Address address;
}
class Address {
  String country;
  String street;
}
 
class UserDTO{
  long burthday;
  String firstName;
  String addressStreet;
}

And solution:
Converter<Date, Long> timeConvertor = new AbstractConverter<Date, Long>() {
  protected Long convert(Date dt) {
    return dt.getTime();
  }
};
 
ModelMapper modelMapper = new ModelMapper();
 
modelMapper.addConverter(timeConvertor);
 
UserDTO userDTO = modelMapper.map(user, UserDTO.class);

Going further this library is smart enough to care about ambiguity, skipping, diffrent naming strategies, mapping in advance and so on.

Well the library is attractive, however there is no magic and reflection is used there, so be aware.


Saturday, January 25, 2014

Why I like scala (part 1)

I bet many developers hate way Java has been evolving for many years. For instance resource safe cleanup feature. Java committee still were discussing, to add or not to add while C# already had nice language keyword using. You might say, Hey, Java 7 already has it. Right, however there are many others pending features that still wait to be added.

However I think that the problem is not only in Java committee. The problem is in language evolution and key grammar rules. Java is not scalable language. Almost every feature requires changes in Java grammar. So many keywords, and it might be greater. Jast take a look on C#. I love C# however number of keywords makes me dissapointed.

Well, there is another set of languages where there is no needs to modify grammar to add something interesting, moreover, there are languages which might create so high-level abstraction  that even their creaters could not expect such usage.These languages are C++, Haskell and Scala (this list is not complete).

Scala has no resoure safe cleanup feature and never won't. Because you can build it on your own.
Let's  develop this feature ourselves.

Version 1. (through dediated trait)
Lets use some trait Closable with just one operation close. Thus implementation might look like this:


trait Closable{
  def close(): Unit 
}

object Using {
}  
  def apply[A <: Closable, B](closeable:A) ( task : => B)  = {
      try {
        task
      } finally { closeable.disconnect }
  }
}

This version is simple and stratforward (just uses curring and parameters by name) however requires inheritance from Closable trait.

Here is usage example:

val resource : Closable = ... // take closable resource

Using(resource){
  // doing something with resource 
  // being confident that it will be closed
}

//ha-ha you can use it even this way
val resource2 : Closable = ...

val ret = Using(resource2){
  // return something meaningful
}

This implementation is simple however requires supporting Closable contract from all involved resources . You  might think that this is serious limitation. Well it is not true. Thanks to implicit boxing we can substitute any external resource with proper boxing object:

implicit def transform(r : InputStream) = new Closable{
  override def close(){
    r.close();
  }
}

val resource = new FileInputStream("/myfile")

Using(resource ){
  // read from resource resource 
}

Version 2. (through Duck typing)
 Scala has quite interesting feature anonimous structural types. New implementation based on structural types can accept any resource of any type (!) with only one requirement -  it should contain just close method inside.
object Using {
  
  def apply[A<:{ def close(): Unit; }, B](closeable:A) ( task : => B)  = {
      try {
        task
      } finally { closeable.disconnect }
  }
}
However despite that this approach is pretty attractive please be aware structural types are implemented via reflection, so it might affect performance a bit.

One more amazing example is synchronized sections. Java has special keyword synchronized. However for scala it is just ordinary method of AnyRef class.
God bless functions and parameters by name.




Friday, January 24, 2014

Clustering Java applications

Trying to explain two so different approaches based on Terracotta and Hazelcast. Enjoy!

MongoDB Distilled

OMG- it was about one year ago.

Spring AOP Introduction

My presentation from Summer Odessa JUG


Continuous DB Migration

Моя презентация с последней встречи Одесской Java группы.

Полезное чтиво. Выпуск 2.

  • CQRS Journey - http://msdn.microsoft.com/en-us/library/jj554200.aspx 
Прекрасное путешествие с наглядным объяснением, что такое Domain Driven Development, CQRS и Event Sourcing. Книжка дает чистый и понятный способ того, как приложение со сложной бизнес логикой и составным доменом может быть разработано. 
  • Developing Multi-tenant Applications for the Cloud
Тема multitenant  приложений становится довольно популярной, и многие заказчики требуют для своих приложений поддержку tenants. Принимая во внимание высокую облачность в последнее время, работу с tenants можно действительно сделать элегантно и масштабируемо.

  • Microsoft Application Architecture Guide - http://msdn.microsoft.com/en-us/library/ff650706.aspx
Отличное описание того, о чем нужно говорить в Software Architecture документах. О каких ключевых решениях может идти речь перед началом проекта. Да и вообще. НЕсмотря на приставку Microsoft рекомендую книгу Java иноверцам, речь идет о фундаментальных вещах, ну а аналогичные технологии есть и в Java.
  • UI  patterns - http://ui-patterns.com/
Хорошее и систематичное собрание разных UI патернов.

  • Writing Great Unit Tests: Best and Worst Practices - http://blog.stevensanderson.com/2009/08/24/writing-great-unit-tests-best-and-worst-practises/
Довольно понятным языком объясняется разница между модульным и интеграционным тестированием. ПРоверьте себя, знаете ли вы какими должны быть качественные модульные тесты.