Some thoughts of the OpenRefine and Akka Integration

akka-logoFollowing my article on enabling parallel processing for OpenRefine: Spark vs Akka, I drafted a road map to integrate OpenRefine with Akka.

Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. In this arcile I will try to explore the possibility to integrate Akka to OpenRefine to enhance the data processing capability.

1. Separate the application layers so we can “divide and conquer”.

As a standalone web application designed to run locally, OpenRefine has a compact architecture which tightly coupled the front end (ButterFly + JQuery + Utility Javascript libraries + Servlet) and backend (including the core engine and components to interact to the third-party services). In order to enable the capability of parallel processing large data set, firstly we need to address is to separate the layers so the engine itself can be moved around and serve as a work unit.

2. Add extra layer such as request broken to handle more complex web service request using Akka routing?

 

3. Replace Embedded Jetty with Spray (base on Akka)

OpenRefine use Jetty server as the web server and servlet container. It’s quite convenient but there’s limitation for the extension for the parallel processing and cluster between the servers. By introducing the Spray, it will streamline the transition and put the servers into the Akka ecosystem.

Another option are possible:

  • We can have the http request interact with the Actor in Akka – Camel Module (Jetty Component).
  • Alternatively a Mist layer can be introduced. The Mist layer was developed to provide a direct connection between the servlet container and Akka actors with the goal of handling the incoming HTTP request as quickly as possible in an asynchronous manner.

4. Introducing Apache Camel and Akka – Camel Module:

Firstly we propose to integrate the Apache Camel with OpenRefine to broad the raw data retrieving capability of different components provided by Apache Camel. More on this soon in a separate post.

Secondly, we will bring it to another level to fit the concurrent paradigm. The main problem need to resolve is an efficient way to split the data set without breaking data integrity because the format, size and separator are unpredictable when retrieving the raw data. The Akka Camel module will help here.

5. Migrating from a single instance transforming Engine to a cluster Engine build on top of Akka

The main goal is to convert the OpenRefine core engine to Actors of Akka. Since the front end is separated out (at stage 1) it allows the end user to interact with the UI without thinking about what happening at back end. All the operations defined by user will communicate to other node by the form of Message. Other nodes will apply the operation when receive it.

 

Do you have experience with Akka? Does this roadmap make sense? I am looking forward to read your comments!