<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Miguel Cabrera's Blog (Posts about spark)</title><link>http://mfcabrera.com/</link><description></description><atom:link href="http://mfcabrera.com/categories/spark.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sat, 23 Oct 2021 09:35:28 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Data Quality Validation for Python Dataframes</title><link>http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html</link><dc:creator>mfcabrera</dc:creator><description>&lt;div id="table-of-contents"&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;div id="text-table-of-contents"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org9871012"&gt;TLDR&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org7a8c770"&gt;Intro - Why Data Quality?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org0974d71"&gt;Data Quality Dimensions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#orgc571079"&gt;Libraries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#orgbd84ff2"&gt;Great Expectations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org6f0e539"&gt;Pandera&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org312ed38"&gt;Deequ/PyDeequ&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org85bc633"&gt;Comparison table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org0e93ee6"&gt;Final Notes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org9871012" class="outline-2"&gt;
&lt;h2 id="org9871012"&gt;TLDR&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org9871012"&gt;
&lt;p&gt;
On this blog post, I review some interesting libraries for checking the quality of the data using Pandas Dataframe (and similar implementations). This is not a tutorial (I was actually trying out some of the tools while I wrote) but rather a review of sorts, so expect to find some opinions along the way.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org7a8c770" class="outline-2"&gt;
&lt;h2 id="org7a8c770"&gt;Intro - Why Data Quality?&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org7a8c770"&gt;
&lt;p&gt;
Data quality might be one of the areas Data scientists tend to overlook the most. Why? Well, let's face it, It is boring and most of the time it is cumbersome to perform data validation. Furthermore, sometimes you do not know if your effort is going to pay off. Luckily, some libraries can help with this laborious task and standardize the process in a Data Science team or even across an organization.
&lt;/p&gt;

&lt;p&gt;
But first things first. Why I would choose to spend my time doing data quality checks, while I can spend my time writing some amazing code that trains a bleeding-edge deep convolutional logistic regression? Here are a couple of reasons:
&lt;/p&gt;

&lt;ul class="org-ul"&gt;
&lt;li&gt;It is hard to ensure data constraints in the source system. Particularly true for legacy systems.&lt;/li&gt;

&lt;li&gt;Companies rely on data to guide business decisions (forecasting, buying decisions), and missing or incorrect data affect those decisions.&lt;/li&gt;

&lt;li&gt;The trend to feed ML systems with this data (these systems are often highly sensitive to input data as the deployed model relies on the assumption on the characteristics of the inputs).&lt;/li&gt;

&lt;li&gt;Subtle errors introduced by changes in the data can be &lt;b&gt;&lt;b&gt;hard&lt;/b&gt;&lt;/b&gt; to detect.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org0974d71" class="outline-2"&gt;
&lt;h2 id="org0974d71"&gt;Data Quality Dimensions&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org0974d71"&gt;
&lt;p&gt;
The quality of the data can refer to the &lt;b&gt;extension&lt;/b&gt; of the data (data values) or to the
&lt;b&gt;intension&lt;/b&gt; (not a typo) of the data (schema) [&lt;a class="org-ref-reference" href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#batini09"&gt;batini09&lt;/a&gt;].
&lt;/p&gt;
&lt;/div&gt;

&lt;div id="outline-container-org46d3672" class="outline-3"&gt;
&lt;h3 id="org46d3672"&gt;Extension Dimension&lt;/h3&gt;
&lt;div class="outline-text-3" id="text-org46d3672"&gt;
&lt;p&gt;
Extracted from [&lt;a class="org-ref-reference" href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#Schelter2018"&gt;Schelter2018&lt;/a&gt;]:
&lt;/p&gt;

&lt;dl class="org-dl"&gt;
&lt;dt&gt;Completeness&lt;/dt&gt;&lt;dd&gt;The degree on which an entity includes data
required to describe a real-world object. Presence of null values (missing values). Depends on context.&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;
&lt;b&gt;Example&lt;/b&gt;:
   Notebooks might not have the &lt;code&gt;shirt_size&lt;/code&gt; property.
&lt;/p&gt;


&lt;dl class="org-dl"&gt;
&lt;dt&gt;Consistency&lt;/dt&gt;&lt;dd&gt;The degree to which a set of semantic rules are violated.

&lt;ul class="org-ul"&gt;
&lt;li&gt;Valid range of values (e.g. sizes &lt;code&gt;{S, M, L}&lt;/code&gt;)&lt;/li&gt;

&lt;li&gt;There might be &lt;i&gt;intra-relation constraint&lt;/i&gt;, e.g. if category is
"shoes" then the sizes should be in range 30-50.&lt;/li&gt;

&lt;li&gt;&lt;i&gt;Inter-relation&lt;/i&gt; constraints may involve multiple tables and columns.
&lt;code&gt;product_id&lt;/code&gt; may only contain entries from the &lt;code&gt;product&lt;/code&gt; table.&lt;/li&gt;
&lt;/ul&gt;&lt;/dd&gt;

&lt;dt&gt;Accuracy&lt;/dt&gt;&lt;dd&gt;Correctness of the data and can be measure in
two ways, semantic and syntactic.

&lt;dl class="org-dl"&gt;
&lt;dt&gt;Syntactic&lt;/dt&gt;&lt;dd&gt;Compares the representation of a value
with a corresponding definition domain.&lt;/dd&gt;

&lt;dt&gt;Semantic&lt;/dt&gt;&lt;dd&gt;Compares a value with its real
world representation.&lt;/dd&gt;
&lt;/dl&gt;&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;
&lt;b&gt;Example&lt;/b&gt;:
 &lt;i&gt;blue&lt;/i&gt; is a syntactically valid value for the column &lt;i&gt;color&lt;/i&gt; (even
 if a product is of color red). &lt;i&gt;XL&lt;/i&gt; would neither semantically or syntactically accurate.
&lt;/p&gt;


&lt;p&gt;
Most of the data quality libraries I am going to explore deal with the &lt;b&gt;extension dimension&lt;/b&gt;. This is particularly important when the data ingested comes from semi-structured or non-curated sources. On the &lt;i&gt;intesion&lt;/i&gt; of the data is where the richest set of checks can be done (i.e. checking the schema would only verify if a field is or certain type but not some additional logical like that what are the  valid values for a string field).
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;


&lt;div id="outline-container-orgc571079" class="outline-2"&gt;
&lt;h2 id="orgc571079"&gt;Libraries&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-orgc571079"&gt;
&lt;p&gt;
The following are the libraries I will quickly evaluate. The idea is to display writing quality checks works and describe a bit of the workflow. I selected these libraries as are the ones I have either been using, reading about, or seeing at conferences. If there is a library that you think should make the list, please let me know in the comment section.
&lt;/p&gt;

&lt;ul class="org-ul"&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#orgbd84ff2"&gt;Great Expectations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org6f0e539"&gt;Pandera&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#org312ed38"&gt;Deequ/PyDeequ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div id="outline-container-org91fd79a" class="outline-3"&gt;
&lt;h3 id="org91fd79a"&gt;Sample Data&lt;/h3&gt;
&lt;div class="outline-text-3" id="text-org91fd79a"&gt;
&lt;p&gt;
I will just a sample dataset to exemplify how different libraries will check similar properties:
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="p"&gt;[&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"awesome thing."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"available at http://thingb.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy D"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"checkout https://thingd.ca"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy E"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="p"&gt;],&lt;/span&gt;
       &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"productName"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"table table-striped table-bordered table-hover"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;border&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;table border="0" class="dataframe table table-striped table-bordered table-hover"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;productName&lt;/th&gt;
      &lt;th&gt;description&lt;/th&gt;
      &lt;th&gt;priority&lt;/th&gt;
      &lt;th&gt;numViews&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;Thingy A&lt;/td&gt;
      &lt;td&gt;awesome thing.&lt;/td&gt;
      &lt;td&gt;high&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;Thingy B&lt;/td&gt;
      &lt;td&gt;available at http://thingb.com&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;low&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;Thingy D&lt;/td&gt;
      &lt;td&gt;checkout https://thingd.ca&lt;/td&gt;
      &lt;td&gt;low&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;Thingy E&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;high&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;


&lt;p&gt;
Things that I will check on this toy data:
&lt;/p&gt;

&lt;ul class="org-ul"&gt;
&lt;li&gt;there are 5 rows in total.&lt;/li&gt;
&lt;li&gt;values of the id attribute are never Null/None and unique.&lt;/li&gt;
&lt;li&gt;values of the &lt;code&gt;productName&lt;/code&gt; attribute are never null/None.&lt;/li&gt;
&lt;li&gt;the priority attribute can only contain "high" or "low" as value.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;numViews&lt;/code&gt; should not contain negative values.&lt;/li&gt;
&lt;li&gt;at least half of the values in description should contain a url.&lt;/li&gt;
&lt;li&gt;the median of &lt;code&gt;numViews&lt;/code&gt; should be less than or equal to 10.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;productName&lt;/code&gt;  column contents matches the regexz &lt;code&gt;r'Thingy [A-Z]+'&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;


&lt;div id="outline-container-orgbd84ff2" class="outline-2"&gt;
&lt;h2 id="orgbd84ff2"&gt;Great Expectations&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-orgbd84ff2"&gt;
&lt;!--div class="img-center"&gt;
&lt;img src="https://docs.greatexpectations.io/en/latest/_images/ge_overview.png"  class="img-polaroid"  /&gt;
&lt;/div --&gt;

&lt;p&gt;
Calling Great Expectation (GE) as library is a bit of an understatement. This is a full -fledged framework for data validation, leveraging existing tools like Jupyter Notbook and integrating with several data stores for validating data originating from them as well storing the validation results.
&lt;/p&gt;

&lt;p&gt;
The main concept of Great Expectations (GE) are well &lt;code&gt;expectations,&lt;/code&gt; that as the name indicate, run assertions on expected values of a particular column.
&lt;/p&gt;

&lt;p&gt;
The simplest way to use GE is to wrap the dataframe or data source with a GE &lt;code&gt;DataSet&lt;/code&gt;  and quickly check individual conditions. This is useful for exploring the data and refining the data quality check.
&lt;/p&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;great_expectations&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;ge&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_table_row_count_to_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_not_be_null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_not_be_null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_be_in_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_values_to_be_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ge_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expect_column_median_to_be_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;
If run interactively in a Notebook, for each expectation we get a json representation of the expectation as well  some metadata regarding the values and whether the expectation failed:
&lt;/p&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"expectation_config"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"meta"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="s2"&gt;"expectation_type"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"expect_column_median_to_be_between"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"kwargs"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"column"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"min_value"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"max_value"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"result_format"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"BASIC"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"exception_info"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"raised_exception"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"exception_traceback"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"exception_message"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s2"&gt;"meta"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
  &lt;span class="s2"&gt;"result"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"observed_value"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"element_count"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"missing_count"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"missing_percent"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;
However this is not the optimal way to use GE. The documentation states that is better to properly configure the datasets and generate a standard directory structure. This is done through a  &lt;i&gt;Data Context&lt;/i&gt; and requires so scaffolding and generating some files using the command line:
&lt;/p&gt;


&lt;pre class="console"&gt;
[miguelc@machine]$ great_expectations --v3-api init

Using v3 (Batch Request) API

  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-&amp;lt;
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~

Let's configure a new Data Context.

First, Great Expectations will create a new directory:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- notebooks
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- documentation
 (...)
&lt;/pre&gt;

&lt;p&gt;
Basically, the process goes as follows:
&lt;/p&gt;

&lt;ol class="org-ol"&gt;
&lt;li&gt;Generate the directory structure (using for example the command above)&lt;/li&gt;
&lt;li&gt;Generate a new data source. You can select  - This opens a Jupyter notebook where you configure the data source and store the configuration under &lt;code&gt;great_expectations.yml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Create the expectation suite, using the &lt;a href="https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html#expectation-glossary"&gt;built-in expectations&lt;/a&gt;  using also Jupyter Notebooks. You store the expectations as &lt;code&gt;json&lt;/code&gt; in the &lt;code&gt;expectations'&lt;/code&gt; directory. A nice way to get started is to use the automated data profiler that examines that data source and generates the expectations.&lt;/li&gt;
&lt;li&gt;Once you execute the notebook, the data docs are shows. &lt;a href="https://docs.greatexpectations.io/en/latest/reference/core_concepts.html#data-docs"&gt;Data docs&lt;/a&gt; show the result of the expectations and other metadata in a nice HTML format that can be useful to finding more about the data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;
Once you have created the initial set of expectations you an edit them using the command &lt;code&gt;great_expectations --v3-api   suite edit articles.warning&lt;/code&gt;. You will have  to chose whether you want to interact with a batch (sample) of data or not.  This will also open a Notebook where you depending on your choice will be able to edit the existing expectations in &lt;a href="https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_and_editing_expectations/how_to_create_a_new_expectation_suite_without_a_sample_batch.html"&gt;slightly different ways&lt;/a&gt;.
&lt;/p&gt;


&lt;p&gt;
Now that you have your expectations set up you can then use them to validate a new batch of data. For that, you need to learn a new additional concept called &lt;a href="https://docs.greatexpectations.io/en/latest/reference/core_concepts/checkpoints_and_actions.html#checkpoints-and-actions"&gt;Checkpoints&lt;/a&gt;. A Checkpoint bundles Batches of data with corresponding Expectation Suites for validation. To create a checkpoint you need, you guessed right, another nice command line and another Jupyter Notebook.
&lt;/p&gt;

&lt;pre class="console"&gt;
[miguelc@machine]$ great_expectations --v3-api checkpoint new my_checkpoint
&lt;/pre&gt;

&lt;p&gt;
If you can execute the above command, it will open a Jupyter Notebook where you can then configure a bunch of stuff using YAML. The key idea here is that with this Checkpoint you link an &lt;code&gt;expectation_suite&lt;/code&gt; with a particular data asset coming from a data source.
&lt;/p&gt;

&lt;p&gt;
Optionally, you can run the checkpoint (the full expectation on the data source) and see the results on the already familiar data_docs interface.
&lt;/p&gt;

&lt;p&gt;
As for deployment. one pattern would be to run the check point as a task in some sort of workflow manager (such as &lt;a href="https://legacy.docs.greatexpectations.io/en/latest/guides/how_to_guides/validation/how_to_run_a_checkpoint_in_airflow.html#how-to-guides-validation-how-to-run-a-checkpoint-in-airflow"&gt;Airflow&lt;/a&gt; or Luigi), you can also run the Checkpoints programmatically using python or straight from the &lt;a href="https://legacy.docs.greatexpectations.io/en/latest/guides/how_to_guides/validation/how_to_run_a_checkpoint_in_terminal.html#how-to-guides-validation-how-to-run-a-checkpoint-in-terminal"&gt;terminal&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
I recently found out that if you use &lt;a href="https://www.getdbt.com/"&gt;dbt&lt;/a&gt;, you get GE installed by default and can be used to extend the unit tests of the SQL queries you write.
&lt;/p&gt;
&lt;/div&gt;


&lt;div id="outline-container-orge1969a0" class="outline-4"&gt;
&lt;h4 id="orge1969a0"&gt;The Good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orge1969a0"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Interactive validation and expectation testing. The instant feedback helps to refine and add checks for data.&lt;/li&gt;
&lt;li&gt;When an expectation fails, you get a sample of the data that does make the expectation fail. This is useful for debugging.&lt;/li&gt;
&lt;li&gt;It is not limited to pandas data frames, it comes with support for many data sources including SQL databases (via SQLAlchemy) and Spark dataframes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-orgdc125f8" class="outline-4"&gt;
&lt;h4 id="orgdc125f8"&gt;The not so good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orgdc125f8"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Seems heavy and full of things. Getting started might not be as easy as there are many concepts to master.&lt;/li&gt;
&lt;li&gt;Although it might seem natural for many potential users, the coupling with Jupyter Notebook/Lab might make some uncomfortable.&lt;/li&gt;
&lt;li&gt;Expectations are stored as JSON instead of code.&lt;/li&gt;
&lt;li&gt;They received some funding recently and they are changing many of already existing (and already large) concepts and API, making the whole process of learning even more challenging.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org6f0e539" class="outline-2"&gt;
&lt;h2 id="org6f0e539"&gt;Pandera&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org6f0e539"&gt;
&lt;p&gt;
&lt;a href="https://pandera.readthedocs.io/en/stable/"&gt;Pandera&lt;/a&gt; [&lt;a class="org-ref-reference" href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#niels_bantilan-proc-scipy-2020"&gt;niels_bantilan-proc-scipy-2020&lt;/a&gt;] is an "statistical data validation for pandas". Using Pandera is simple, after installing the package you have to define a Schema object where each column has a set of checks. Columns might be optionally nullable. That is, checking for nulls is not a check per se but a quality/characteristic of a column.
&lt;/p&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandera&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pa&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="p"&gt;[&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"awesome thing."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"available at http://thingb.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy D"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"checkout https://thingd.ca"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Thingy E"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="p"&gt;],&lt;/span&gt;
       &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"productName"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrameSchema&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Check&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
	&lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Check&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;greater_than_or_equal_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
	&lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
	&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s2"&gt;"productName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;validated_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validated_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;
If you run the validation an exception will be raised:
&lt;/p&gt;

&lt;pre class="example" id="orga50d9f4"&gt;
Traceback (most recent call last):
  File "&amp;lt;stdin&amp;gt;", line 26, in &amp;lt;module&amp;gt;
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 648, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 594, in validate
    error_handler.collect_error("schema_component_check", err)
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 586, in validate
    result = schema_component(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1826, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 214, in validate
    validate_column(check_obj, column_name)
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 187, in validate_column
    super(Column, copy(self).set_name(column_name)).validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1720, in validate
    error_handler.collect_error(
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: non-nullable series 'description' contains null values: {2: None, 4: None}
&lt;/pre&gt;


&lt;p&gt;
The code would look similar to other data validation libraries (e.g. &lt;a href="https://marshmallow.readthedocs.io/en/stable/"&gt;Marshmallow&lt;/a&gt;). Also, compared to GE the library offers the Schema abstraction, which you might or not like it.
&lt;/p&gt;


&lt;p&gt;
With Pandera, if a check fails, it will raise a proper exception (you can disable this and turn it into a &lt;code&gt;RuntimeWarning&lt;/code&gt;). Depending on how you might want to integrate the checks into the larger pipeline, this might be useful or plainly annoying. Furthermore, if you look closely, Pandera only displays one validation error as the cause of the validation error, although there is more than one column that does not comply with the specification.
&lt;/p&gt;

&lt;p&gt;
Given that this is Python library is relatively easy to integrate into any existing pipeline. It can be a task in Luigi/Airflow for example or something that could be run as part of a larger task.
&lt;/p&gt;
&lt;/div&gt;

&lt;div id="outline-container-org79fa728" class="outline-4"&gt;
&lt;h4 id="org79fa728"&gt;The Good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-org79fa728"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Familiar API based on schema checking that makes the library easy to get started with.&lt;/li&gt;
&lt;li&gt;Support for hypothesis testing on the columns.&lt;/li&gt;
&lt;li&gt;Data profiling and recommendation of checks that could be relevant.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-orgc3b4b22" class="outline-4"&gt;
&lt;h4 id="orgc3b4b22"&gt;The not so good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orgc3b4b22"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Very few checks included under the &lt;code&gt;pa.Check&lt;/code&gt; class&lt;/li&gt;
&lt;li&gt;The message is not very informative if the check is done through a lambda function.&lt;/li&gt;
&lt;li&gt;Errors during the checking procedure will raise a run-time exception by default.&lt;/li&gt;
&lt;li&gt;It apparently only works with Pandas, it is not clear if it would work with any other implementation or Spark.&lt;/li&gt;
&lt;li&gt;I did not find a way to test for properties on the size of the dataframe or to do comparisons across different runs (i.e. the number of rows should not decrease between runs of the check).&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org312ed38" class="outline-2"&gt;
&lt;h2 id="org312ed38"&gt;Deequ/PyDeequ&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org312ed38"&gt;
&lt;p&gt;
Last but not least, let us talk about Deequ [&lt;a class="org-ref-reference" href="http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html#Schelter2018"&gt;Schelter2018&lt;/a&gt;] . Deequ a data checking library written in Scala targeted towards Spark/PySpark dataframes and thus aims to check large datasets making use of Spark optimization to run in a performant manner. pyDeequ as the names implies, is a Python wrapper offering the same API for pySpark.
&lt;/p&gt;

&lt;p&gt;
The idea behind deequ is to create "&lt;i&gt;unit tests for data&lt;/i&gt;", to do that, Deequ calculates &lt;code&gt;Metrics&lt;/code&gt; through &lt;code&gt;Analyzers&lt;/code&gt;, and assertions are verified based on that metric. A &lt;code&gt;Check&lt;/code&gt; is a set of a assertions to be checked. One interesting feature of  (Py)Deequ is that it allows to compare metrics across different runs, allowing to perform assertions on changes on the data (e.g. an unexpected jump in the number of rows of a dataframe).
&lt;/p&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydeequ.checks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Check&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydeequ.verification&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VerificationSuite&lt;/span&gt;

&lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CheckLevel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Review Check"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;checkResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;VerificationSuite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;addCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
	&lt;span class="n"&gt;check&lt;/span&gt;
	&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hasSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sz&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# we expect 5 rows&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isComplete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# should never be None/Null&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isUnique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# should not contain duplicates&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isComplete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"productName"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# should never be None/Null&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isContained_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isNonNegative&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
	  &lt;span class="c1"&gt;# at least half of the descriptions should contain a url&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containsUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
	  &lt;span class="c1"&gt;# half of the items should have less than 10 views&lt;/span&gt;
	  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hasQuantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"numViews"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
	&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;checkResult_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;VerificationResult&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkResultsAsDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checkResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;checkResult_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;
After calling run, PyDeequ will compute some metrics on the data. Afterwards it
invokes your assertion functions (e.g., &lt;code&gt;lambda sz: sz == 5&lt;/code&gt; for the size check)
on these metrics to see if the constraints hold on the data. The metrics
calculated can be stored in a &lt;code&gt;MetricRepository&lt;/code&gt; (e.g. S3 or disk) for future
reference and to make comparison between metrics of different runs.
&lt;/p&gt;

&lt;p&gt;
(Py)Deequ allows for differential calculations of metrics, that is, the metrics
calculated for a dataset can be updated when the data increases without having
to recalculate the metrics from the whole dataset.
&lt;/p&gt;

&lt;p&gt;
Another unique feature of (Py)Deequ is anomaly detection, whereas
GreatExpections allows for single thresholds, (Py)Deequ allows for a checks
based on a running average and standard deviation of the metrics calculated.
&lt;/p&gt;

&lt;p&gt;
Similar to Pandera, PyDeequ is easy to integrate to your existing code base as
it is PySpark/Python code.
&lt;/p&gt;
&lt;/div&gt;

&lt;div id="outline-container-org6c98443" class="outline-4"&gt;
&lt;h4 id="org6c98443"&gt;Deequ for Pandas DataFrames&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-org6c98443"&gt;
&lt;p&gt;
You might be wondering if you can use (Py)Deequ for Pandas, and it is sadly not possible. However, almost a year ago I developed an experimental port Deequ to Pandas. I called it &lt;a href="https://github.com/mfcabrera/hooqu"&gt;Hooqu&lt;/a&gt;.
However, due to personal constraints, I haven't been able to maintain it, but it is still functional (albeit by using a lot of Pandas hacks) and you can install it via pip.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org6d52e57" class="outline-4"&gt;
&lt;h4 id="org6d52e57"&gt;The Good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-org6d52e57"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Use PySpark to parallelize otherwise expensive checks.&lt;/li&gt;
&lt;li&gt;Support for external metric repositories.&lt;/li&gt;
&lt;li&gt;Data profiling.&lt;/li&gt;
&lt;li&gt;Constraint suggestion.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="outline-container-orgdfda640" class="outline-4"&gt;
&lt;h4 id="orgdfda640"&gt;The not so good&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orgdfda640"&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;This is not a pure Python project, rather a wrapper over a Scala/Spark library, and thus the code might not look pythonic.&lt;/li&gt;
&lt;li&gt;Only make sense to use it if you are already using a (py)Spark cluster.&lt;/li&gt;
&lt;li&gt;It is your responsibility to load the data from whenever it resides into a Spark dataframe. There are no "connectors" or "loaders" off-the-shelf.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org85bc633" class="outline-2"&gt;
&lt;h2 id="org85bc633"&gt;Comparison table&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org85bc633"&gt;
&lt;p&gt;
Let's finish with a table summarizing the features of the different libraries:
&lt;/p&gt;


&lt;table border="0" cellspacing="0" cellpadding="6" rules="groups" frame="hsides" class="dataframe table table-bordered table-hover center" align="center"&gt;


&lt;colgroup&gt;
&lt;col class="org-left"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-left"&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th scope="col" class="org-left"&gt; &lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;GreatExpectations&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Pandera&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;PyDeequ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="org-left"&gt;Checks Extension dimension  (Values)&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Checks the intension dimension (Schema)&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt; Pandas support &lt;sup&gt;1&lt;/sup&gt; &lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Spark support&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Multiple data sources (Database loaders, etc.)&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Data Profiling&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Constraint/Check Suggestion&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Hypothesis Testing&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Incremental computation of the checks&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt;Simple Anomaly Detection&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-left"&gt; Complex Anomaly Detection &lt;sup&gt;2&lt;/sup&gt; &lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-remove"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;div class="text-center"&gt; &lt;span class="glyphicon glyphicon-ok"&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;


&lt;ol class="org-ol"&gt;
&lt;li&gt;Hooqu offers a PyDeequ-like API for Pandas dataframes.&lt;/li&gt;
&lt;li&gt;Using running averages and standard deviation of incremental computation.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;


&lt;div id="outline-container-org0e93ee6" class="outline-2"&gt;
&lt;h2 id="org0e93ee6"&gt;Final Notes&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org0e93ee6"&gt;
&lt;p&gt;
So, after all this deluge of information, which library should I use?. Well, all these libraries have their strong points and the best choice will depend on your goal, which environment are you familiar with, and the sort of checks you want to perform.
&lt;/p&gt;

&lt;p&gt;
For small Pandas-heavy projects, I would recommend using Pandera (or Hooqu if you are a brave soul). If your organization is larger, you like Jupyter notebooks, and you do not mind the learning curve, I would recommend GreatExpectations as it has currently a lot of traction. If you write your pipelines mostly in (Py)Spark and you care about performance I would go for (Py)Deequ. Both are Apache projects, are easy to integrate with your codebase and will make better use of your Spark cluster.
&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;&lt;h2 class="org-ref-bib-h2"&gt;References&lt;/h2&gt;
&lt;ul class="org-ref-bib"&gt;&lt;li&gt;&lt;a id="batini09"&gt;[batini09]&lt;/a&gt; &lt;a name="batini09"&gt;&lt;/a&gt;Carlo Batini, Cinzia Cappiello, Chiara, Francalanci &amp;amp; Andrea Maurino, Methodologies for Data Quality Assessment and  Improvement, &lt;i&gt;ACM Computing Surveys&lt;/i&gt;, &lt;b&gt;41(3)&lt;/b&gt;, 1-52 (2009). &lt;a href="https://doi.org/10.1145/1541880.1541883"&gt;link&lt;/a&gt;. &lt;a href="http://dx.doi.org/10.1145/1541880.1541883"&gt;doi&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a id="Schelter2018"&gt;[Schelter2018]&lt;/a&gt; &lt;a name="Schelter2018"&gt;&lt;/a&gt;Schelter, Lange, Schmidt, Celikel, Biessmann &amp;amp; Grafberger, Automating large-scale data quality verification, 1781-1794, in in: Proceedings of the VLDB Endowment, edited by Association for Computing Machinery (2018)&lt;/li&gt;
&lt;li&gt;&lt;a id="niels_bantilan-proc-scipy-2020"&gt;[niels_bantilan-proc-scipy-2020]&lt;/a&gt; &lt;a name="niels_bantilan-proc-scipy-2020"&gt;&lt;/a&gt;Niels Bantilan,  pandera: Statistical Data Validation of Pandas Dataframes ,  116 - 124 , in in:  Proceedings of the 19th Python in Science Conference , edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut &amp;amp; David Shupe, ( 2020 )&lt;/li&gt;
&lt;/ul&gt;

&lt;/div&gt;
&lt;/div&gt;</description><category>dataqa</category><category>machine_learning</category><category>pandas</category><category>spark</category><guid>http://mfcabrera.com/blog/pandas-dataa-validation-machine-learning.html</guid><pubDate>Thu, 21 Oct 2021 08:00:00 GMT</pubDate></item></channel></rss>