Data extraction projects are complex and often require quite a lot of time and effort. To make sure your organization is creating value and that your money and your time are well spent, the first logical step is to choose your sources carefully. To help you achieve just that, we create a list of 10 questions you need to ask before you set your sights on a dataset. The goal here is to collect and analyze all the data existing information in order to clarify its ownership, publication, structure, content, quality, relationship, etc. Only by going through this process can you guarantee the suitability of your sources and identify potential problems and particularities.
This checklist will help you assess all the elements you need to know in order to proceed with your data project. Most of all, once you have all the answers, you will have everything you need to define what will be your game plan to transform and manipulate the datasets you chose.
So, without further ado, here are ten questions to ask before using new data.
- Question 1: Who owns the data?
- Question 2: Who publishes the data?
- Question 3: Is the dataset documented?
- Question 4: How is the data collected?
- Question 5: How is the data maintained and updated?
- Question 6: What are the format and granularity?
- Question 7: Does the data follow standards?
- Question 8: Can you link your data to another dataset?
- Question 9: Under what licenses the data is released?
- Question 10: Are there data privacy issues?
Question 1. Who owns the data?
And by “own,” we don’t necessarily mean “publish” (see question 2). You need to know where the data originally comes from, and whom to contact if you ever have any questions or issues that need solving. Also, if you ever need to attribute ownership (see question 9) when reusing the data, this owner will be the one you will refer too. Basically, you need to put a human face and a name to the data you wish to extract and use.
Question 2. Who publishes the data?
There are a lot of platforms out there that offer huge datasets, like Quandl, and data.world. But it doesn’t mean they own the data they share with you. It is, therefore, imperative that you know who owns and publishes and shares the data. This distinction will help you better answer the following questions.
Question 3. Is the dataset documented?
You need to gather all the information on how the data was collected (see questions 4 and 5) and how it should be interpreted (see question 7). This includes the schema of the data, with the data type and validation rules. The goal here is to make sure you can answer most questions without having to go back to the data owner.
Question 4. How is the data collected?
The answer to this question will help you identify any potential biases in the way data is collected. It will also give you some extremely important information concerning the data itself: is the data complete or partial? Has it been pre-processed before its publication? You want to know what the original state of the data was and how much it has changed (or not) before reaching you.
Question 5. How is the data maintained and updated?
Now that you know what the data looked like originally, you want to know what processes it goes through. For example, you want to ensure your data will still be reliable in the long-term. You also want to know how often it’s updated and if the set contains all the records or only updated ones. You also need to know if there’s a change in the collect methodology, or if the dataset stops being available.
Without these answers, you might end up building a script for something that won’t be available in two days’ time, or not in the format you expected.
Question 6. What are the format and granularity?
You must identify the formats in which your data is made available. Formats can usually be categorized as follows:
- NonFriendly Format (PDF, web page, Word document, image)
- Flat file (CSV, XLS)
- Structured file (JSON, XML)
- API and web service, provided by the source or by a third party (Quandl, data.world)
- Maps (KML, Shapefile, GeoJSON)
Once you know what format you’ll need to deal with, you can better choose the tools and solutions you’ll need to extract and transform your data. It will also help you identify the data granularity, or the lowest data point available. If, for example, your information concerns time, you want to know if the smallest possible data point is second, minute, hour, day, month, or year. The same goes for maps: address, postal code, city, or state?
Knowing what your data looks like and what shape it takes will guarantee that you have all the information you need to extract it using the right tools and solutions.
Question 7. Were specific standards applied to the dataset?
When data is collected and published according to certain standards, it helps remove ambiguity on the collection, aggregation, and preparation methods. Standardization also allows us to compare and combine data according to jurisdiction or period. Data regarding elections, 311 calls, census or transit information, for example, are all standardized.
Question 8. Can you link the data to another dataset?
When profiling your data, you need to make sure you understand all its relationships with other datasets. Is it isolated, or could it be combined with other internal or external data? What new insight can you build from it? How could you merge them? Do they share a common key? Basically, the goal here is to profile your data as it related to other data.
Question 9. Was the data published under a license? And if so, which one?
Some organizations will choose to publish their data under a license, which then defines how the data should be collected, shared, and used. You need to be aware of these licenses and understand how they work. The most common are:
ODC Public Domain Dedication and Licence (PDDL)
With PPDL, users can share, create, and adapt the document. There’s no restriction and the dataset is public domain.
Open Data Commons Attribution License (ODC-By)
With ODC-By, users can share, create, and adapt the document. The only restriction is the attribution, which means that users need to cite the source.
Open Data Commons Open Database License (ODC-ODbL)
With ODC-ODbL, users can share, create, and adapt the document, but they need to cite the sources, share with the same license.
Unfortunately, custom licenses are extremely popular. They necessitate that you read it, understand it, and make sure you respect its specific requirements. This could impact how you can collect and transform your data, as well as how you can use it.
Question 10. Are there data privacy issues?
Privacy is protected differently from state to state. Important differences even exist between Canada, the United States and Europe. You need to know if your datasets contain Personally Identifiable Information (PII) or if it would be possible to re-identify individuals based on anonymized data. This is especially true with healthcare data.
AND WITH THAT
Knowing what data you need is not enough to start a data extraction project. More than “what,” you need to know “who” your data is. Knowing your data is the only way to know for sure you’re using the right tool, at the right schedule, with the right script, to get the right data and transform it correctly.
This list of ten questions should be your first step in defining if a dataset is worth all the efforts you’re ready to put into it. Data projects are complex projects on their own and they require that you plan them well.
Choose your sources is only the first step to a long story. Depending on your sources and needs (are you dealing with unfriendly formats like PDFs?, you’ll need to define the best tools for web scraping, the best way to maintain data quality throughout the whole process, the best way to build a solid ETL process, and the best architecture for data extraction processes.
We have the experts to make it happen.