Improve OpenRefine’s extensibility with data repository and processing services.
Martin Magdinier | 26 January 2016
In September 2015, we submitted the following application to the Knight News Challenge: How might we make data work for individuals and communities? We are cross posting it here for archive. You can also consult it directly on the Knight Challenge website.
Improve OpenRefine’s extensibility with data repository and processing services.
In one sentence, describe your idea as simply as possible.
Because data cleaning and preparation is a critical part of a data-management processes, we want to improve OpenRefine’s connectivity by offering a seamless interface to data repositories and processing services (often API based) for non-technical users.
The Proposition
Over the past few years with the proliferation of data available, data has become the center point for analytics, system migration or reconciliation process. However, data cleaning and preparation (which includes data normalization, duplicate removal, pivoting, joining, and splitting data) is still a major hurdle in the process and tools available for end users haven’t fully caught up. Spreadsheets offer entry level interface to the data but are time consuming and don’t scale, while programming languages offer flexibility but have a steep learning curve for the non technical person.
OpenRefine addresses the growing data literacy gap by lowering the technical skills needed to normalize and prepare data. OpenRefine empowers those who understand the context in which the data are generated or used by offering them the best of both worlds with an iterative interface for data discovery and preparation and an easy-to-learn scripting language.
We believe OpenRefine can be the WordPress for data processing. WordPress democratised website building for the last 15 years moving the web industry from web developers crafting pages to website assembler and editors, writing content and building a website by extending a core base with plugin and extension in a WYSIWYG interface. OpenRefine is the platform that will power this shift in the data industry by moving from developers crafting custom code for data cleaning and processing to builder of data process using connectors in a point and click interface.
Thanks to OpenRefine, citizens with an in-depth knowledge or interest in of a specific issue can:
- explore open data related to the topic; drill down to have a sense of the information available; and find nuggets of information or inconsistency gaps.
- clean and export the data to a format useful for his needs by doing data normalization, removing duplicates and typos, pivoting, joining and splitting columns.
- enrich the project by joining data sets together, processing data via an API, or working with a reconciliation service.
OpenRefine offers a solid core on which other developers can build innovative modules, much like WordPress plugins, enabling non technical users to integrate OpenRefine into their data workflow. Through this application we are addressing three challenges facing the OpenRefine community:
- Continue to support and develop OpenRefine
- Improve OpenRefine’s extensibility
- Grow OpenRefine user base
Over the last 5 years, OpenRefine became a major piece of software used by:
- Librarians: DST4L – LODLAM
- Journalists: NYT, Chicago Tribune, Le Monde, The Guardian
- Open Data Communities: Sunlight Foundation, OKFN
- Educational tool: School of Data
A growing ecosystem
In addition to the core data refining software, OpenRefine’s extension-friendly architecture allows the user community – journalists, librarians, researchers – to customize and contextualize OpenRefine within their data environment. Extensions can be broken down into four types:
- Importing data from a system
- Exporting data to a system
- Querying remote data processing services via their APIs
- Using Reconciliation Service to extend your data by doing fuzzy join with a remote master data source. Reconciliation helps to align taxonomy and import new information into your project.
The following map lists the 16 reconciliation services and has 10 community-contributed plugins working with OpenRefine, as well as projects that have done heavy customization to add OpenRefine in their data manipulation processes.
… which needs support to scale
Through this application we want to continue to make it easier for non technical users to integrate OpenRefine in their data workflow by:
- Making more functionality available via point and click menus;
- Make it easier for other developers to extend the OpenRefine core and build new extensions to
- work with remote data processing services available today only via API
- connect data repository for data import and export
- Keep improving resources for users (documentation, tutorials, training program).
1. Continue to support and develop OpenRefine
From inception until late 2012 the project had full time engineers working on the project while incubated by Metaweb Technologies, Inc. and Google. In October 2012 Google decided to move their engineers to other internal projects, leaving it up to the community to carry on with the project. At first, things went well. The 100% volunteer-based community got together to migrate the project from Google code to Github and prepare the 2.6 release. After 10 months, the community released the OpenRefine 2.6 beta. Since then the level of commitment started to fall preventing a final release from being made. On one side OpenRefine has a steadily growing user base and ecosystem and on the other side the stable core could benefit from an upgrade. With Qi Cui joining the project in January 2015, the 2.6 final release is now scheduled to be launched by end of 2015.
Following on RefinePro’s current contribution, we would like to commit more developer time to:
- support the code base and take care of bug fixes,
- work with the developer community to define the roadmap and integrate proposed contributions
- package and release new versions, and
- develop new functionality requested by the community
2. Improve OpenRefine’s overall extensibility and create connectors with API and data repositories
OpenRefine is currently a stand-alone application with file-based input and output. However data cleaning and preparation is always part of a larger process, so improving OpenRefine’s connection with applications, data repositories and processing services will help non technical users to integrate OpenRefine into their data workflow.
OpenRefine’s extensible architecture has proven to be a real value-added feature. We think the current 10 extensions and 16 reconciliation services is only a start and we would like to see a growing number. To this end we want to work with the community to increase the number of connectors available to enable non-technical users to:
- export or load data to specific repository or application like Socrata, CiviCRM, or Mapbox;
- use data processing services currently only available via API (like geocoding, translation or sentiment analysis services);
- easily set up and leverage reconciliation end points with their own data.
Connectors will offer a point and click experience for non-developers when integrating OpenRefine in their data project for
- analytic (cleaning the data before using them in machine learning or data visualization);
- migration (making the data in the right format before loading them into a system); and
- reconciliation (joining and validating consistency of different data set).
Our plan is to create one of each type of extension and based on our experience, and other community members’ feedback, improve the overall process (documentation, code, hooks …) for other developers.
3. Improve Resources for ends users OpenRefine’s user base
OpenRefine proven to be a great tool to introduce non technical user to data project. In order to help more people to explore and leverage data to make a change in their community we want to increase the general awareness of OpenRefine and make it easier for new user to learn it. We want to make this happen by working around three axis
- do presentation at conference, webinar and publish tutorials to introduce OpenRefine to new publics
- organize more workshop and training (online and in person) on OpenRefine ourself and with organizations already working to improve data literacy
- continue to develop the existing user documentation.