The State of Pandas Data Validation

Intro

Lately I have very concerned about validating the inputs (mostly tabular) data of my ML algorithms. In many companies (and I can say in most of the real-world use cases) data is noisy and incomplete. For old companies using antiquated legacy systems this situation can be very severe, affecting downstream ML sytems and data products.

Data Validation - What are we talking about

What do I refer to when I say data validation. Well, I am talking about checking the assumptions on the data that is ultimately fed into a machine learning or other systems like dashboards. From experience I have noticed that many Data Scientist do not invest enough time in checking and logging the properties of the data and this causes errors or weird behaviour in production systems.

The standard types of validation that you can think of inclide:

  • Type validation
  • Existence of null / NaN / empty fields
  • Range valdiation (e.g. positive integers less than 100)
  • Pattern validation (e.g a particular column fields contain certain type of data)
  • Date validation (i.e. the date can be well formed but something like "2019-02-30" does not make sense)

Generally some of these validations can be specified using some sort of schema that is in turn used to compare a particular instance of a DataFrame against those described constraints.

Validating the Data Frames using Pandas expressions

Pandas itself offers a rich set if expression that can be used to select but to also check DataFrame columns with respect a certain type of properties.

df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4

However although it is easy to implement it might be hard to easily centralize and specify all the properties that particular DataFrame needs to have.

Data Validation for Machine Learning

Current existing framewoork

Links

Comments

Comments powered by Disqus