Blog
Enabling parallel processing for OpenRefine

Enabling parallel processing for OpenRefine: Spark vs Akka

Qi Cui | 14 April 2015

OpenRefine logo Andrey from SpazioDati developed Refine on Spark in an attempt to process larger dataset which is good. However it fell short in some areas and I wanted to benchmark it with an other parallelism engine like Akka.

Spark supports the Akka in its core module and Spark and Akka can interact with each other. Akka provides the Spark template. But it makes more sense to only choose one. If we want to enable the parallel processing for OpenRefine, they have their pros and cons (IMO). See also a proposed road-map to integrate OpenRefine with Akka.

OpenRefine Dataset and Operation Mapping

Akka: use Customized Actor.
Spark use RDD and operation over RDDS
Note: Spark fits into the abstractions better. But Akks’s actor could be more flexible and nature for architecture transform from standalone to parallel

Real time data Processing

Akka: No easy way on OpenRefine
Spark: Work with Kinesis connector or other steaming component
Note: Most of the data cleansing / wrangling is batch mode. Do we want to make Refine a real time processor engine?

OpenRefine Community friendly

Akka: Better. Akka integration is less intrusive and community user can have easier choice to opt in /out the Akka integration. Also the UI will keep the same .
Spark: Different data processing workflow, which means more intrusive

Apache Camel Integration

Akka: Has the Apache Camel module so it allow the data ingestion and delivery ran concurrently
Spark: N/A
Note: Data file split, merge should be managed in OpenRefine Akka integration.

Ability to consume other third-part services

Akka: The ways to interact with other services will the same as before.
Spark: Need some extra work to integrate with Spark.

Easily expose REST service to third-part services

Akka: Spary provide the ability
Spark: to be investigated

Scalability for big file processing

Akka: Scale out. The work unit is the server Scale up and scale out.
Spark: The work unit can be spark cluster or streaming component.

Can you think at other parallelism engine for Refine? Did I miss a key point in my analysis? The comment section is yours!

Scale your Open Source Comm... Some thoughts of the OpenRe...