Data Quality Validation for Pandas Dataframes

TLDR

On this blog post I review some interesting libraries for checking the quality of the data using Pandas Dataframe (and similar implementations).

Intro - Why Data Quality?

Data quality might be one of the areas hof Data Science that Data Scientist tend to overlook the most.

Reason: It is boring and most the time cumbersome to perform data validation. Furthermore, sometimes you do not know if you effort is going to pay off.

Luckily, there are some libraries that can help with this laborious task and standardize the process in a Data Science team or even across the organization.

But first things first. Why I would like to spend my time doing data quality checks while I can spend my time writing some amazing code that train a bleeding edge logistic regression? Here a couple of reasons:

  • It is hard to ensure data constraints in the source system. This is particularly true for legacy systems.
  • Companies rely on data to guide business decision (forecasting, buying decisions) and missing or incorrect data affect those decisions.
  • Trend to feed ML system with this data (these systems are often highly sensitive to input data as the deployed model relies on assumption on the characteristics of the inputs).
  • Subtle errors introduced by changes in the data can be hard to detect.

Data Quality Dimensions

These definitions come from [batini09]

Bibliography

  • [batini09] Carlo Batini, Cinzia Cappiello, Chiara, Francalanci & Andrea Maurino, Methodologies for Data Quality Assessment and Improvement, ACM Computing Surveys, 41(3), 1-52 (2009). link. doi.

Comments

Comments powered by Disqus