Data Verification for Machine Learning

Intro

Data is the necessary ingredient for Machine Learning projects, and more and more companies rely on such systems for important decision process. Therefore, missing or incorrect data can have a negative impact in the downstream business processes.

Data Quality Intention vs Extension

  • Quality of the data: The data values
  • Intension: Schema

So frameworks for checking the schema they do a Intensional data check, frameworks like Deequ and Hooqu check the intension of the data.

Data Quality Dimension

  • Completeness
  • Consistency
  • Intra-relation constraints
  • Accuracy {Semantic and Syntactic}

Common Causes of Data issues

  • Bugs in program code,
  • data loss
  • changes in semantic or columns
  • new columns added

Technical Debt Paper.

Data Validation - What are we talking about

What do I refer to when I say data validation. Well, I am talking about checking the assumptions on the data that is ultimately fed into a machine learning or other systems like dashboards. From experience I have noticed that many Data Scientist do not invest enough time in checking and logging the properties of the data and this causes errors or weird behaviour in production systems.

The standard types of validation that you can think of inclide:

  • Type validation
  • Existence of null / NaN / empty fields
  • Range valdiation (e.g. positive integers less than 100)
  • Pattern validation (e.g a particular column fields contain certain type of data)
  • Date validation (i.e. the date can be well formed but something like "2019-02-30" does not make sense)

Generally some of these validations can be specified using some sort of schema that is in turn used to compare a particular instance of a DataFrame against those described constraints.

Validating the Data Frames using Pandas expressions

Pandas itself offers a rich set if expression that can be used to select but to also check DataFrame columns with respect a certain type of properties.

df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4

However although it is easy to implement it might be hard to easily centralize and specify all the properties that particular DataFrame needs to have.

Data Validation for Machine Learning

Current existing framewoork

Links

Comments

Comments powered by Disqus