RefinePro Technology Stack

There is no silver bullet to build an efficient data acquisition system and we always adapt our approach and architecture based on your needs. RefinePro selected the following technologies because they are flexible to support the variety and the ever-changing format of data sources while lowering the effort to maintain our processes. We strive to limit the involvement of technical consultant to support and maintain our workflow.  Our stack encompasses three stages:

  1. Data Collection and Acquisition
  2. Data Integration and Normalization
  3. Data Enrichment

 

Data Collection and Acquisition

Web scraping and harvesting represent a significant portion of any third party data acquisition system. RefinePro selected Visual Web Ripper and Content Grabber (both published by Sequentum) as our go-to solution. Both programs provide a point and click interface to build and maintain scraper while supporting any website format including search form, highly dynamic websites using AJAX or captcha.

Visual Web Ripper is the first web scraping solution developed by Sequentum and suite simple low volume project for small business and individual users. Content Grabber, released in 2017, offers advanced debugging, logging, agent management, error handling, and error recovery.

RefinePro has experience building, supporting and orchestrating hundreds of web scraping jobs using both platforms.

 

 

 

Extracting data from PDF can be challenging. RefinePro uses DocParser and its API to integrate the PDF extraction task directly into our workflow. When a file format changes, we use Docparser user interface to quickly and easily update a parser settings without the need for coding skills. For uneven PDF files, we use custom libraries to extract the information needed.

Open data and subscription data are often made available in CSV, XML or JSON format via public pages, secure servers or API and do not represent any particular challenges to retrieve. In those cases, RefinePro uses its data integration tool suite.

 

 

Data Integration and Normalization

 

 

OpenRefine is a free, open source tool for working with messy data. Its intuitive interface for data discovery and preparation empowers those who understand the context in which the data are generated to explore and normalize them. OpenRefine is used by

OpenRefine enables to model a complex data normalization project quickly. Thanks to the working prototype we can demonstrate the value and identify future challenges before starting the project. A lot of our customers reached out to us to improve a data integration project initially build with OpenRefine. RefinePro also uses OpenRefine for data profiling and nonrepeatable data cleaning project including one-time migration.

Talend Open Studio is the industry leader in open integration solutions and democratizes application integration by providing open source solutions to address any integration challenge – from simple departmental projects to complex, heterogeneous IT environments. Talend’s open source products and open architecture create unmatched flexibility so to solve integration challenges.

Talend Open Studio is RefinePro’s backbone for automation. We developed custom components and routines to integrate other technologies into our workflow and we can assemble a custom and cost-effective ETL system using standard components.

 

 

Data Enrichment

 

Address and Person Data Enrichment

 

We leverage paid third-party sources to normalize and enrich location data. We have experience working with everythinglocation to perform address validation, geocoding. We previously worked with Industrial Info Resources and, FullContact to identify customer’s social profile and enrich prospect information.

Crowdsourcing and microtask

 

Some data cleaning task cannot be automated and can only be solved by a human. We have experienced setting up crowdsourcing project, distribute the workload into microtasks and assign them to remote worker. We worked with public (Amazon Mechanical Turk, Crowdflower) and private crowdsourcing platforms using pybossa.