Enabling parallel processing for OpenRefine: Spark vs Akka

akka vs sparkAndrey from SpazioDati developed Refine on Spark in an attempt to process larger dataset which is good. However it fell short in some areas and I wanted to benchmark it with an other parallelism engine like Akka.

Spark supports the Akka in its core module and Spark and Akka can interact with each other. Akka provides the Spark template. But it makes more sense to only choose one. If we want to enable the parallel processing for OpenRefine, they have their pros and cons (IMO).  See also a proposed road-map to integrate OpenRefine with Akka.

OpenRefine Dataset and Operation Mapping

  • Akka: use Customized Actor.
  • Spark use RDD and operation over RDDS
  • Note: Spark fits into the abstractions better. But Akks’s actor could be more flexible and nature for architecture transform from standalone to parallel

 

Real time data Processing

  • Akka: No easy way on OpenRefine
  • Spark: Work with Kinesis connector or other steaming component
  • Note: Most of the data cleansing / wrangling is batch mode. Do we want to make Refine a real time processor engine?

 

OpenRefine Community friendly

  • Akka: Better. Akka integration is less intrusive and community user can have easier choice to opt in /out the Akka integration. Also the UI will keep the same .
  • Spark: Different data processing workflow,  which means more intrusive

 

Apache Camel Integration

  • Akka: Has the Apache Camel module so it allow the data ingestion and delivery ran concurrently
  • Spark: N/A
  • Note: Data file split, merge should be managed in OpenRefine Akka integration.

 

Ability to consume other third-part services

  • Akka: The ways to interact with other services will the same as before.
  • Spark: Need some extra work to integrate with Spark.

 

Easily expose REST service to third-part services

  • Akka: Spary provide the ability
  • Spark: to be investigated

 

Scalability for big file processing

  • Akka: Scale out. The work unit is the server Scale up and scale out.
  • Spark: The work unit can be spark cluster or streaming component.

 

 

Can you think at other parallelism engine for Refine? Did I miss a key point in my analysis? The comment section is yours!

See also a proposed road-map to integrate OpenRefine with Akka.