Data Quality & Cleansing

Gain confidence in your data.

Trust in data quality and integrity is key to the success of any data initiative. Ranging from records addition and edition to data integration, any operations on a dataset present risks to corrupt data. Data profiling, cleaning and validation processes are the three pillars to build confidence in data.

RefinePro guides organizations through the entire data quality process.


Data Profiling

Without well-defined goals, data cleaning can be an endless task. Data quality is a subjective topic as expectation varies from one business to another. For example, within the same organization, different departments will have competing definitions of a clean address. Data profiling is the opportunity to understand business needs, assess the current quality of the data and identify any gaps. Based on that information, RefinePro proposes a data quality strategy and implementation plan based on the client’s budget.


Data Validation

Before starting the actual cleansing, RefinePro recommends automating the data quality enforcement with a script. It allows all parties to agree on a common standard before beginning the project. It also ensures that developers clean the data accordingly as they can continuously check their result against defined rules. RefinePro identified fours type of data validation rules:

  • Schema and format compliance

  • Validation against custom business rules

  • Validation against master dataset

  • Data Sampling with acceptance threshold



Data Cleaning

RefinePro has extensive knowledge working with dirty data and can perform the following normalization steps:

Field Mapping Pivot, transform, split or merge fields from multiple sources to match different application and file schema.
Field Conversion Standardize and change the field type and format between date, number, text or choice list
Duplicate Removal RefinePro leverages multiples techniques including clustering and entity resolution to detect and merge duplicate records.
Missing Data Missing data may be derived based on other values, retrieved from a third party source or filtered out.


Ongoing Validation and Monitoring

RefinePro suggests implementing data validation steps early in the data flow to isolate bad records before they propagate to other systems. Using rules defined previously, the data validation script raises alerts when erroneous records are identified or when the error rate goes over a threshold. For more sensitive workflow, RefinePro implement circuit breaker to stop the process and prevent poor data from corrupting downstream systems. RefinePro provides managed services to monitors alerts and updates the data cleaning scripts accordingly.


Data Cleansing Toolbox

RefinePro adapts the tool to the project type and complexity.

  • Simple One Time Clean-up


    OpenRefine enables non-technical users to review and perform one-time cleaning steps without coding skills.
  • Complex or Recurring Normalization


    With Talend Open Studio and Python, RefinePro can implement complex cleaning rules directly in the data flow.

  • OpenRefine
  • Talend
  • Python

How can RefinePro's expertise enable your project?


  • Access on-demand data quality & cleansing experts.
    Team Augmentation
  • Schedule, run and monitor data quality & cleansing scripts
    Platform
  • Turn your strategic vision into a streamlined process.
    Team + Platform