Data Quality Validation for Python Dataframes

TL;DR

In this blog post, I review some interesting libraries for checking the quality of the data using Pandas and Spark data frames (and similar implementations). This is not a tutorial (I was actually trying out some of the tools while I wrote) but rather a review of sorts, so expect to find some opinions along the way.

Intro - Why Data Quality?

Data quality might be one of the areas Data scientists tend to overlook the most. Why? Well, let's face it, It is boring and most of the time it is cumbersome to perform data validation. Furthermore, sometimes you do not know if your effort is going to pay off. Luckily, some libraries can help with this laborious task and standardize the process in a Data Science team or even across an organization.

But first things first. Why I would choose to spend my time doing data quality checks, while I can spend my time writing some amazing code that trains a bleeding-edge deep convolutional logistic regression? Here are a couple of reasons:

  • It is hard to ensure data constraints in the source system. Particularly true for legacy systems.
  • Companies rely on data to guide business decisions (forecasting, buying decisions), and missing or incorrect data affect those decisions.
  • The trend to feed ML systems with this data (these systems are often highly sensitive to input data as the deployed model relies on the assumption on the characteristics of the inputs).
  • Subtle errors introduced by changes in the data can be hard to detect.

Data Quality Dimensions

The quality of the data can refer to the extension of the data (data values) or to the intension (not a typo) of the data (schema) [batini09].

Extension Dimension

Extracted from [Schelter2018]:

Completeness
The degree on which an entity includes data required to describe a real-world object. Presence of null values (missing values). Depends on context.

Example: Notebooks might not have the shirt_size property.

Consistency
The degree to which a set of semantic rules are violated.
  • Valid range of values (e.g. sizes {S, M, L})
  • There might be intra-relation constraint, e.g. if the category is "shoes" then the sizes should be in the range 30-50.
  • Inter-relation constraints may involve multiple tables and columns. product_id may only contain entries from the product table.
Accuracy
The correctness of the data and can be measured in two ways, semantic and syntactic.
Syntactic
Compares the representation of a value with a corresponding definition domain.
Semantic
Compares a value with its real world representation.

Example: blue is a syntactically valid value for the column color (even if a product is of color red). XL would neither semantically nor syntactically accurate.

Most of the data quality libraries I am going to explore focus on the extension dimension. This is particularly important when the data ingested comes from semi-structured or non-curated sources. On the intesion of the data is where the richest set of checks can be done (i.e. checking the schema would only verify if a field is of a certain type but not some additional logical like that what are the valid values for a string field).

Libraries

The following are the libraries I will quickly evaluate. The idea is to display writing quality checks works and describe a bit of the workflow. I selected these libraries as are the ones I have either been using, reading about, or seeing at conferences. If there is a library that you think should make the list, please let me know in the comment section.

Sample Data

I will just a sample dataset to exemplify how different libraries will check similar properties:

import pandas as pd
df = pd.DataFrame(
       [
	   (1, "Thingy A", "awesome thing.", "high", 0),
	   (2, "Thingy B", "available at http://thingb.com", None, 0),
	   (3, None, None, "low", 5),
	   (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
	   (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)
return df.to_html(
    classes="table table-striped table-bordered table-hover",
    border=0
)
id productName description priority numViews
0 1 Thingy A awesome thing. high 0
1 2 Thingy B available at http://thingb.com None 0
2 3 None None low 5
3 4 Thingy D checkout https://thingd.ca low 10
4 5 Thingy E None high 12

Things that I will check on this toy data:

  • there are 5 rows in total.
  • values of the id attribute are never Null/None and unique.
  • values of the productName attribute are never null/None.
  • the priority attribute can only contain "high" or "low" as value.
  • numViews should not contain negative values.
  • at least half of the values in description should contain a url.
  • the median of numViews should be less than or equal to 10.
  • The productName column contents matches the regexz r'Thingy [A-Z]+'

Great Expectations

Calling Great Expectation (GE) as library is a bit of an understatement. This is a full -fledged framework for data validation, leveraging existing tools like Jupyter Notbook and integrating with several data stores for validating data originating from them as well storing the validation results.

The main concept of Great Expectations (GE) are well expectations, that as the name indicate, run assertions on expected values of a particular column.

The simplest way to use GE is to wrap the dataframe or data source with a GE DataSet and quickly check individual conditions. This is useful for exploring the data and refining the data quality check.

import great_expectations as ge
ge_df = ge.from_pandas(df)
ge_df.expect_table_row_count_to_equal(5)
ge_df.expect_column_values_to_not_be_null("id")
ge_df.expect_column_values_to_not_be_null("description")
ge_df.expect_column_values_to_be_in_set("priority", {"high", "low"})
ge_df.expect_column_values_to_be_between("numViews", 0)
print(ge_df.expect_column_median_to_be_between("numViews", 0, 10))

If run interactively in a Notebook, for each expectation we get a json representation of the expectation as well some metadata regarding the values and whether the expectation failed:

{
  "expectation_config": {
    "meta": {},
    "expectation_type": "expect_column_median_to_be_between",
    "kwargs": {
      "column": "numViews",
      "min_value": 0,
      "max_value": 10,
      "result_format": "BASIC"
    }
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 5.0,
    "element_count": 5,
    "missing_count": null,
    "missing_percent": null
  }
}

However this is not the optimal way to use GE. The documentation states that is better to properly configure the datasets and generate a standard directory structure. This is done through a Data Context and requires so scaffolding and generating some files using the command line:

[miguelc@machine]$ great_expectations --v3-api init

Using v3 (Batch Request) API

  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~

Let's configure a new Data Context.

First, Great Expectations will create a new directory:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- notebooks
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- documentation
 (...)

Basically, the process goes as follows:

  1. Generate the directory structure (using for example the command above)
  2. Generate a new data source. You can select - This opens a Jupyter notebook where you configure the data source and store the configuration under great_expectations.yml
  3. Create the expectation suite, using the built-in expectations using also Jupyter Notebooks. You store the expectations as json in the expectations' directory. A nice way to get started is to use the automated data profiler that examines that data source and generates the expectations.
  4. Once you execute the notebook, the data docs are shown. Data docs show the result of the expectations and other metadata in a nice HTML format that can be useful to learn more about the data.

Once you have created the initial set of expectations you can edit them using the command great_expectations --v3-api suite edit articles.warning. You will have to choose whether you want to interact with a batch (sample) of data or not. This will also open a Notebook where you depending on your choice will be able to edit the existing expectations in slightly different ways.

Now that you have your expectations set up you can then use them to validate a new batch of data. For that, you need to learn a new additional concept called Checkpoints. A Checkpoint bundles Batches of data with corresponding Expectation Suites for validation. To create a checkpoint you need, you guessed right, another nice command line and another Jupyter Notebook.

[miguelc@machine]$ great_expectations --v3-api checkpoint new my_checkpoint

If you can execute the above command, it will open a Jupyter Notebook where you can then configure a bunch of stuff using YAML. The key idea here is that with this Checkpoint you link an expectation_suite with a particular data asset coming from a data source.

Optionally, you can run the checkpoint (the full expectation on the data source) and see the results on the already familiar data_docs interface.

As for deployment. one pattern would be to run the checkpoint as a task in some sort of workflow manager (such as Airflow or Luigi), you can also run the Checkpoints programmatically using python or straight from the terminal.

I recently found out that if you use dbt, you get GE installed by default and can be used to extend the unit tests of the SQL queries you write.

The Good

  • Interactive validation and expectation testing. The instant feedback helps to refine and add checks for data.
  • When an expectation fails, you get a sample of the data that does make the expectation fail. This is useful for debugging.
  • It is not limited to pandas data frames, it comes with support for many data sources including SQL databases (via SQLAlchemy) and Spark dataframes.

The not so good

  • Seems heavy and full of things. Getting started might not be as easy as there are many concepts to master.
  • Although it might seem natural for many potential users, the coupling with Jupyter Notebook/Lab might make some uncomfortable.
  • Expectations are stored as JSON instead of code.
  • They received some funding recently and they are changing many of already existing (and already large) concepts and API, making the whole process of learning even more challenging.

Pandera

Pandera [niels_bantilan-proc-scipy-2020] is an "statistical data validation for pandas". Using Pandera is simple, after installing the package you have to define a Schema object where each column has a set of checks. Columns might be optionally nullable. That is, checking for nulls is not a check per se but a quality/characteristic of a column.

import pandas as pd
import pandera as pa

df = pd.DataFrame(
       [
	   (1, "Thingy A", "awesome thing.", "high", 0),
	   (2, "Thingy B", "available at http://thingb.com", None, 0),
	   (3, None, None, "low", 5),
	   (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
	   (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

schema = pa.DataFrameSchema({
    "id": pa.Column(int, nullable=False),
    "description": pa.Column(str, nullable=False),
    "priority": pa.Column(str, checks=pa.Check.isin(["high", "low"]), nullable=True),
    "numViews": pa.Column(int, checks=[
	pa.Check.greater_than_or_equal_to(0),
	pa.Check(lambda c: c.median() >= 0 and c.median() <= 10)
	]
    ),
    "productName": pa.Column(str, nullable=False),

})

validated_df = schema(df)
print(validated_df)

If you run the validation an exception will be raised:

Traceback (most recent call last):
  File "<stdin>", line 26, in <module>
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 648, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 594, in validate
    error_handler.collect_error("schema_component_check", err)
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 586, in validate
    result = schema_component(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1826, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 214, in validate
    validate_column(check_obj, column_name)
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 187, in validate_column
    super(Column, copy(self).set_name(column_name)).validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1720, in validate
    error_handler.collect_error(
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: non-nullable series 'description' contains null values: {2: None, 4: None}

The code would look similar to other data validation libraries (e.g. Marshmallow). Also, compared to GE the library offers the Schema abstraction, which you might or not like it.

With Pandera, if a check fails, it will raise a proper exception (you can disable this and turn it into a RuntimeWarning). Depending on how you might want to integrate the checks into the larger pipeline, this might be useful or plainly annoying. Furthermore, if you look closely, Pandera only displays one validation error as the cause of the validation error, although there is more than one column that does not comply with the specification.

Given that this is Python library is relatively easy to integrate into any existing pipeline. It can be a task in Luigi/Airflow for example or something that could be run as part of a larger task.

The Good

  • Familiar API based on schema checking that makes the library easy to get started with.
  • Support for hypothesis testing on the columns.
  • Data profiling and recommendation of checks that could be relevant.

The not so good

  • Very few checks included under the pa.Check class
  • The message is not very informative if the check is done through a lambda function.
  • Errors during the checking procedure will raise a run-time exception by default.
  • It apparently only works with Pandas, it is not clear if it would work with any other implementation or Spark.
  • I did not find a way to test for properties on the size of the dataframe or to do comparisons across different runs (i.e. the number of rows should not decrease between runs of the check).

Deequ/PyDeequ

Last but not least, let us talk about Deequ [Schelter2018] . Deequ a data checking library written in Scala targeted towards Spark/PySpark dataframes and thus aims to check large datasets making use of Spark optimization to run in a performant manner. PyDeequ, as the name implies, is a Python wrapper offering the same API for pySpark.

The idea behind deequ is to create "unit tests for data", to do that, Deequ calculates Metrics through Analyzers, and assertions are verified based on that metric. A Check is a set of assertions to be checked. One interesting feature of (Py)Deequ is that it allows comparing metrics across different runs, allowing to perform assertions on changes on the data (e.g. an unexpected jump in the number of rows of a dataframe).

from pydeequ.checks import Check
from pydeequ.verification import VerificationSuite

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = (
    VerificationSuite(spark)
    .onData(df)
    .addCheck(
	check
	.hasSize(lambda sz: sz == 5)  # we expect 5 rows
	  .isComplete("id")  # should never be None/Null
	  .isUnique("id")  # should not contain duplicates
	  .isComplete("productName")  # should never be None/Null
	  .isContained_in("priority", ("high", "low"))
	  .isNonNegative("numViews")
	  # at least half of the descriptions should contain a url
	  .containsUrl("description", lambda d: d >= 0.5)
	  # half of the items should have less than 10 views
	  .hasQuantile("numViews", 0.5, lambda v: v <= 10)
	)
    .run()
)

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

After calling run, PyDeequ will compute some metrics on the data. Afterwards it invokes your assertion functions (e.g., lambda sz: sz == 5 for the size check) on these metrics to see if the constraints hold on the data. The metrics calculated can be stored in a MetricRepository (e.g. S3 or disk) for future reference and to make comparison between metrics of different runs.

(Py)Deequ allows for differential calculations of metrics, that is, the metrics calculated for a dataset can be updated when the data increases without having to recalculate the metrics from the whole dataset.

Another unique feature of (Py)Deequ is anomaly detection, whereas GreatExpections allows for single thresholds, (Py)Deequ allows for a checks based on a running average and standard deviation of the metrics calculated.

Similar to Pandera, PyDeequ is easy to integrate to your existing code base as it is PySpark/Python code.

Deequ for Pandas DataFrames

You might be wondering if you can use (Py)Deequ for Pandas, and it is sadly not possible. However, almost a year ago I developed an experimental port Deequ to Pandas. I called it Hooqu. However, due to personal constraints, I haven't been able to maintain it, but it is still functional (albeit by using a lot of Pandas hacks) and you can install it via pip.

The Good

  • Use PySpark to parallelize otherwise expensive checks.
  • Support for external metric repositories.
  • Data profiling.
  • Constraint suggestion.

The not so good

  • This is not a pure Python project, rather a wrapper over a Scala/Spark library, and thus the code might not look pythonic.
  • Only make sense to use it if you are already using a (py)Spark cluster.
  • It is your responsibility to load the data from whenever it resides into a Spark dataframe. There are no "connectors" or "loaders" off-the-shelf.

Comparison table

Let's finish with a table summarizing the features of the different libraries:

  GreatExpectations Pandera PyDeequ
Checks Extension dimension (Values)
Checks the intension dimension (Schema)
Pandas support 1
Spark support
Multiple data sources (Database loaders, etc.)
Data Profiling
Constraint/Check Suggestion
Hypothesis Testing
Incremental computation of the checks
Simple Anomaly Detection
Complex Anomaly Detection 2
  1. Hooqu offers a PyDeequ-like API for Pandas dataframes.
  2. Using running averages and standard deviation of incremental computation.

Final Notes

So, after all this deluge of information, which library should I use?. Well, all these libraries have their strong points and the best choice will depend on your goal, which environment are you familiar with, and the sort of checks you want to perform.

For small Pandas-heavy projects, I would recommend using Pandera (or Hooqu if you are a brave soul). If your organization is larger, you like Jupyter notebooks, and you do not mind the learning curve, I would recommend GreatExpectations as it has currently a lot of traction. If you write your pipelines mostly in (Py)Spark and you care about performance I would go for (Py)Deequ. Both are Apache projects, are easy to integrate with your codebase, and will make better use of your Spark cluster.

References

  • [batini09] Carlo Batini, Cinzia Cappiello, Chiara, Francalanci & Andrea Maurino, Methodologies for Data Quality Assessment and Improvement, ACM Computing Surveys, 41(3), 1-52 (2009). link. doi.
  • [Schelter2018] Schelter, Lange, Schmidt, Celikel, Biessmann & Grafberger, Automating large-scale data quality verification, 1781-1794, in in: Proceedings of the VLDB Endowment, edited by Association for Computing Machinery (2018)
  • [niels_bantilan-proc-scipy-2020] Niels Bantilan, pandera: Statistical Data Validation of Pandas Dataframes , 116 - 124 , in in: Proceedings of the 19th Python in Science Conference , edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut & David Shupe, ( 2020 )

Using mypy for Improving your Codebase

TL;DR

In this article I use mypy to document and add static type checking to an existing codebase and I describe the reasons why I believe using mypy can help in the refactoring and documentation of legacy code while following the The Boy Scout Rule.

Intro

We all love Python, it is a multi-paradigm dynamic programming language very popular in Data Science and Machine Learning. Besides some small quirky things in the language, I am quite happy with how it is evolving. However, there are some areas where I thought Python could do better for improving programming productivity in specific contexts:

  • While is easy to hack around scripts and get something running, managing a large complex codebase becomes an issue. You can get something working really fast, but maintaining it can become an issue if your code base becomes large enough.
  • Many times while reading other people's code (heck, even my own code), and even when documented, it is really hard to figure out what a method or function is doing without a clear knowledge of the types you are working with. In many cases having just the type information (i.e. via a simple comment) would make understanding the code a whole lot faster.

I have also spent a lot of time debugging just because the wrong type was passed to a function/method (e.g. the wrong variable was passed to a method, wrong argument order, etc.). Because of Python's dynamic typing the interpreter and/or linter could not warn me. Plus, some of those errors only were evident at execution time, generally in edge cases.

Although we all like working on greenfield projects, in the real world you will have to work with legacy code and it will generally be ugly and full of issues. Let's take a look at at some Python 2.7 legacy code I have to maintain:

# snnipets.py
def get_hotel_type_snippets(self, hotel_type_id, cat_set):
	snippets = self.get_snippets(hotel_type_id, "pos")
	snippets += list(it.chain.from_iterable(
		self.get_snippets(
			rel_cat,
			cat_set[rel_cat].sentiment
		)
		for rel_cat
		in cat_set[hotel_type_id].cat_def.related_cats
		if rel_cat in cat_set and cat_set[rel_cat].sentiment == "pos"
	))
	return snippets[:self.max_snippets]

Don't focus too much on the fact that it has no documentation and forget about the ugly comprehension inside.

In order to understand this code I have to answer the following questions:

  • What type is hotel_type_id (is it an int?)
  • What type is cat_set, it looks like a dictionary containing something else.

These two issues could be fixed with a proper docstring, however comments sometimes don't contain all the information required, don't include the type of the parameters being passed or can be easily inconsistent as the code might have been changed but the comment not updated.

If I want to understand the code I will have to look for its usage, maybe grepping through the code for something called related_cats or sentiment. If you have a large codebase, you might even find many classes implementing the same method name.

I have two choices when I need to modify existing code like this. I can either hack my way around, modifying it enough to make it do what I want, or I can look for a way to make this code better (i.e. the The Boy Scout Rule). Besides adding the needed documentation, it would be cool to have a way to specify the types that could be potentially used by a static linter.

Enter mypy

Luckily I was not the only one with this problem (or desire), and that's one of the reasons PEP-484 came to life. The goal is to provide Python with optional type annotations that allow an offline static linter to check for type issues. However I believe making the code easier to understand (via type documentation) is an awesome side-product.

There is an implementation of this PEP called mypy that is in fact the inspiration for the first. Mypy provides a static type checker that works in Python 3 (using type annotations) and Python 2.7 (using specific crafted comments).

At TrustYou we have a lot of Python 2.7 legacy code that suffers many of the issues mentioned above, so I decided to give it a try in a new project I was working on and I have to say it helped catch some issues early in the development stage. I also tried in it in an existing code base that because of its structure was hard to read.

Let's go back to the example code I shared before and let's document the code using type annotations:

from typing import Any, List, Dict
from metaprecomp.tops_flops_bake.category import CategorySet

def get_hotel_type_snippets(self, hotel_type_id, cat_set):
	# type: (str, CategorySet) -> List[Dict[str, Any]]

	snippets = self.get_snippets(hotel_type_id, "pos")
	# (...) as before

As you might guess, (str, Category) are the types of the method parameters. What follows -> is the return type, in this example, a list of dictionaries from str to Any. Any is a catch all-type. It helps when you don't know they type (in this case, i would have had to read the code further, and I was too lazy) or when the function can return literally any type.

Some notes from the code above:

  • You might have noticed the from typing import Any, ..., the typing library brings the required types into Python 2.7, even when used only as comments. So yeah, you will need to add it to your requirements.txt.
  • You also noticed I had to import explicitly CategorySet from the category model (even if I used it as a comment). I find that good as I am stating there's a relationship or dependency between those modules.
  • Finally, you also noticed the # noqa: F401. This is to avoid flake8 or pylint to complain about unused imports. This is not nice, but it is minor annoyance.

Installing and running mypy

So far we have used mypy syntax (actually PEP 484 - Type Hints) to do some annotation, but all this hassle should bring something to the table besides a nifty documentation. So let's install mypy and try the command line.

Running mypy requires a Python 3 environment so if your main Python environment is 2.7 you will need to install it in a separate one. Luckly you can call the binary directly (even when your Py27 environment is activated). I you use Anaconda you can easily create a dedicated environment for mypy:

[miguelc@machine]$ conda create -n mypy python=3.6
(...)
[miguelc@machine]$ source activate mypy
(mypy)[miguelc@machine]$ pip install mypy  # to get the latest mypy
(mypy)[miguelc@machine]$ ln -s `which mypy` $HOME/bin/mypy   # I have $HOME/bin in my $PATH
(mypy)[miguelc@machine]$ source deactivate
[miguelc@machine]$ mypy --help    # this should work

With that out of the way, we can start using mypy executable for checking our source code. I run mypy the following way:

[miguelc@machine]$ mypy --py2 --ignore-missing-imports  --check-untyped-defs  [directory or files]
  • --py2: indicates that the code to check is a Python 2 codebase.
  • --ignore-missing-imports tells mypy to ignore error messages when imports cannot be resolved, e.g. when they don't exist on the env mypy is running.
  • --check-untyped-defs: checks functions but does not fail if the arguments are not typed.

The command line tool provides a lot of options and the documentation is very good. An interesting feature is that it allows you to generate reports that can be displayed using CI tools like Jenkins.

Checking for type errors

Let's take a look at another method I annoated for the purpose of exemplifying the type of errors you can find using mypy after adding type annotations:

from typing import Any, List, Dict, FrozenSet  # noqa: F401

def get_snippets(
		self, category_id, sentiment,
		pos_contradictory_subcat_ids=frozenset(),
		neg_contradictory_subcat_ids=frozenset()):
		# type: (str, str, FrozenSet[str],  FrozenSet[str]) -> List[Dict[str, str]]

		# (...) not relevant code...

Indeed, another method with no documentation whatsoever. So I had to read a little bit of the code to figure out what are the input and return types. Now let's imagine that somewhere in the code something like this happens:

# bake_reduce.py
cat = 13
# (...)
snippets_generator = SnippetsGenerator(
    snippets_by_cat_sent,
    self.metacategory_bundle[lang]
)
snippets_generator.get_snippets(cat, "pos")

If I run mypy I would get the following error:

[miguelc@machine]$ mypy --ignore-missing-imports  --check-untyped-defs  --py2  metaprecomp/tops_flops_bake/bake_reduce.py
metaprecomp/tops_flops_bake/bake_reduce.py:238: error: Argument 1 to "get_snippets" of "SnippetsGenerator" has incompatible type "int"; expected "str"

If you come from the static typed language world this should look really normal to you, but for Python developers finding an error like this (in particular in large code bases) requires to spend quite a bit of time debugging (and sometimes the use of Voodoo magic).

When to use mypy

Optional type annotations are that, optional. You can start hacking as normal using the speed that Python dynamic typing gives you and once your code is stable enough you can gradually add type annotations to help avoid bugs and to document the code. The mypy FAQ contains some scenarios in which a project will benefit from using static type annotations:

  • Your project is large or complex.
  • Your codebase must be maintained for a long time.
  • Multiple developers are working on the same code.
  • Running tests takes a lot of time or work (type checking may help you find errors early in development, reducing the number of testing iterations).
  • Some project members (devs or management) don’t like dynamic typing, but others prefer dynamic typing and Python syntax. Mypy could be a solution that everybody finds easy to accept.
  • You want to future-proof your project even if currently none of the above really apply.

In the particular case of my team, a lot of the code we write ends up running for quite a long time inside of MapReduce (Hadoop) jobs, so being able to detect bugs ahead of time would save precious developer time and make everyone happier.

Adding support to Emacs

By now you might be thinking that it would be cool to integrate mypy checks into your editor. Some, like PyCharm, already support this. For Emacs you can integrate mypy into Flycheck via flycheck-mypy. You can install it via M-x package-install flycheck-mypy. Configuring it is a matter of setting a couple of variables:

(set-variable 'flycheck-python-mypy-executable "/Users/miguel/anaconda2/envs/py35/mypy/mypy")
(set-variable 'flycheck-python-mypy-args '("--py2"  "--ignore-missing-imports" "--check-untyped-defs"))

Mypy recommends disabling all other linters/checkers like flake8 and others when using it, however I wanted to keep both at the same time (call me paranoid). In Emacs, you can accomplish this with the following configuration:

(flycheck-add-next-checker 'python-flake8 'python-mypy)

Final words and references

Using mypy won't magically find errors in your code, it will be as good as the type annotations you add and the way you structure the code. Also, it is not a replacement for proper documentation. Sometimes there are methods/functions that become easier to read just by adding type annotations, but documenting key parts of the code is vital for ensuring code maintainability and extensibility.

I did not mention all the features of mypy so please check official documentation to learn more.

There are a couple of talks that can serve as a nice introduction to the topic:

The first one of them is given by Guido, who's pushing the project a lot. Thus, I expect mypy to become more popular in the following years. Happy hacking.

This Site Runs on Nikola

I have changed the site from Jekyll to Nikola. Mainly because I am mostly a Python coder nowadays and because the Jekyll version I was using it was kind of hacky and full of patches I created. So let's see if I finally get to write more often. I have been thinking on becoming an Iron Blogger but it might be too much. If I start blogging more using this setup I will definitively consider it.

I really dig that Nikola offers good support of org-mode format (which I use to store my notes and other personal information). Actually, this post was written using org-mode from Emacs. It also has good support of blogging with Jupyter Notebooks, which I use a lot also at work. It also comes with capabilities for importing from systems like Wordpress or Jekyll. I do have to change Disqus URLs to match the new format.

The rest of the post it is going to be me trying stuff and some features of Nikola and org-mode Emacs.

for x in range(1, 10):
    print(x)
print list(range(0, 9))
def func(x):
    print("Hola mundo")
    if x > 0:
        print("Fuck")

With org-babel I can run the code inside get the output. I find that pretty-neat when blogging and showing code snippets.

Short codes need to be surrounded by by

It took me a while to get used to Nikola style and properly configure it. I should have written the steps down but I forgot. However, there are plenty of resources.

Ha-doopla! Tool for displaying Python Hadoop Streaming Errors

Introducing Doopla

At TrustYou we use Luigi a lot for building our Data Pipelines, mostly made of batch Hadoop map reduce Job. We have a couple of clusters, one using a pretty old version of Hadoop and one more recent, where we use HDP 2.7.

Writing Hadoop MR jobs in Python is quite nice, and it is even more straight forward using Luigi's support. The big issue is when you are developing and you have to debug. Although I try to decrease the amount of time of in-cluster debugging (by for example using domain classes and writing unit tests against them), sometimes you have no choice.

And then the pain comes. One you Mapper or your reducer fails most of the times Luigi cannot show you the reason of the failure and you have to go to the web interface and manually click through many times until you sort of find the error message, with hopefully enough debugging information.

So after debugging my MR jobs this way for a while I got really annoyed and decided to automate that part and I created Doopla , a small script that fetches the the output (generally stderr) of a failed mapper and / or reducer, and using Pygments highlights the failing Python code. It not jobid is specified if will fetch the output from the last failed job. It was a two hours hack at the beginning so it is not a code I am proud of so I made it public and even send it to Pypi (a chance to learn something new as well), so it can be installed easily by just writing pip install doopla.

It initally only supported our old Hadoop version, but last one worked with HDP 2.7 (and I guess it might work for other Hadoop versin). New version of Hadoop offer an REST API for querying job status and information, but I kept scraping the information (hey, it is a hack).

You can also integrate that in Emacs (supporting the highlighting and everything) with code like:

And then hit M-x doopla to obtain the same without leaving your lovely editor.

Attending Europython 2016

I attended this year once again and it was pretty amazing. The weather did not help a lot but both the content and the people were really good. This time I gave a short talk about some of the things I leared in these couple of years using Python for data processing. The idea was to show a wide range of topics so the talk was not a hardcore technical talk but quite basic.

There is also a video available for the curious.

Overall a good experience, some really good talks, in particular the lighting talk show was really entertaining. KeyNotes were really well selected and the food was ok. I missed some of the nice sponsor from last year and Microsoft did not even show up.

Next year is going to be in Milan, so I am really looking forward to the next year's Europython.

Running Luigi Hadoop JobTask in a Virtual Environment

Virtual Environments and Hadoop Streaming

If you are using Python to write Hadoop streaming job you might have experience the issues of keeping the nodes with the required packages. Furthermore, if you happen to have different set of jobs, workflows or pipelines that require different version of packages you might find yourself in not so conformable situation.

A former work colleague wrote on how to aleviate this by using Python's virtual environments. So I am going to assume you quickly browse to the article and you are wondering how to do something similar but with Luigi.

Before taling about Luigi, a summary of running streaming jobs with virtualenvs (without Luigi):

Normally, if you don't need a virtualenv, you will write a Python script for the mapper and one for the reducer and assuming you have already the data you need to process on HDFS you will call it something like this:

[mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py

So here mapper.py is the mapper and reducer.py is the reducer. Nothing new if you have used Python for Hadoop Streaming. So, let's asumme we want a particular module that is not installed at the system level on the nodes of the cluster. Say, spaCy:

[mc@host]$ virtualenv-2.7 demoenv
New python executable in demoenv/bin/python2.7 ... done.
[mc@host]$ source demoenv/bin/activate
(demoenv)[mc@host]$ pip install spacy
Collecting spacy
...
Successfully built plac semver
Installing collected packages: cymem, preshed, murmurhash, thinc, semver, sputnik, cloudpickle, plac, spacy
Successfully installed cloudpickle-0.2.1 cymem-1.31.1 murmurhash-0.26.3 plac-0.9.1 preshed-0.46.3 semver-2.4.1 spacy-0.100.6 sputnik-0.9.3 thinc-5.0.7
(demoenv)[mc@host]$ deactivate
[mc@host]$ virtualenv-2.7 --relocatable demoenv
cd demoenv; zip --quiet --recurse-paths ../demoenv.zip *
hadoop fs -put -f demoenv.zip

I make the virtualenv relocatable so that it can be moved and both the binaries and libraries are referenced using relative paths. Bear that the documentation also says that this feature is experimental and has some caveats, so use it at your own risk. I also compress it and upload it to HDFS.

Now to run it we need to do two thing, change the shebang of the script to point to the venv and point to the archive with -archives parameter when running the hadoop streaming job. Assuming we are creating a link to the archive with the name demoenv we change the beginning of mapper.py and reducer.py:

#!./demoenv/bin/python

import spacy
....

And then we execute:

[mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py
-archives hdfs://[host]:8020/user/mc/demoenv.zip#demoenv

Note the archives parameter with the symlink. That symlink has to match the path specified on the shebang.

Running Luigi HadoopJobTask in a Python Venv

So far I have showed nothing new but a compressed version of Hennig's post. So far it was impossible to do something similar with Luigi unless you created a JobRunner by basically rewriting (i.e. copy and pasting) some of Luigi's code. So I decided to make a small contribution to Luigi that would allow me to implement something similar to the things described in the previous section.

With that in Luigi code, is easy to create a new base class that pull the virtual environment location from Luigi's configuration and set-up a runner that pass the parameter to add the archive in underlying Hadoop streaming call.

I created the VenvJobTask that read the virtual environment location from the configuration. It can be local or it can be located on HDFS. It overrides the job_runner method to setup properly the python executable path (so no shebang modification is needed in this case). It references a small custom runner class that changes the default behavior of DefaultHadoopJobRunner to pass the appropriate parameter.

At the time of writing, this change has not been yet released, so it will be probably part of Luigi 2.1.2. In the mean time, you can install Luigi directly from Github. I tested the above code on Python 2.7.6 and Hadoop 2.7.1 in Hortoworks HDP 2.3.

Starting Over

The best way to avoid starting over is never to stop doing something. Easy to say, hard to accomplish. The good thing is that this is mostly a personal exercise as I don't have many reader and I generally put here only technical stuff. Anyways, a lot has changed since the last time I actually wrote something (besides a boring list of links, I will probably continue doing it though.).

  • I moved to Berlin (it's been a year, amazing experience).
  • I still work for TrustYou, but now remotely.
  • I started a Medium blog, mostly in Spanish.
  • I presented at PyData Berlin 2015. Met Awesome people. Good experience.
  • I attended Europython 2016, my first Python event. Another good experience. Bilbao and the sourrounding area is very beautiful.
  • I started playing my accordion more often. Now I take classes of the Colombian style via Skype with Luis Javier. An awesome teacher.
  • I traveled to Colombia for a couple of months. I worked 6 weeks from there. (Does that make me a "Digital Nomad"?).
  • I gave a talk in Colombia (in Spanish) about my work. Another highlight from the trip.
  • I got into Ethnomusicology, in particular Gaita Music and traditional tropical rythms of Colombia. I still need to read some books of Peter Wade.
  • I bought some Gaita Colombiana flutes (Kuisi in Kogi or Chuanas in Zenu language) from a good instrument maker in the proximity of Sahagún (Colombia). I plan to learn how to play them, at least a couple of songs.

Berlin is way different to Munich. Munich is clean, beautiful and organized. It is close to the Alps, and it is perfect for doing things outdoors, when the weather allows. Tech scene is OK in Munich but not that crazy. Although poorer, Berlin has a more dynamic Startup environment and way more events and things to do inside the city. It has lot of interest group both technology and hobbies. It is way more multicultural, so you can you events of many types. There huge scenes of different types of musics and art. In particular I am very interested in the Art and Tech, Music Technology and Maker scene and events of the city. There some things I miss about Munich, but I am pretty happy with the change. I think I will stay in this city for some years.

Regarding this blog, I am going to try to keep posting my link logs but I will also try to write more about what I do day to day, mostly technical stuff. I will write on the Medium blog in Spanish and when post are not so technical.

Linklog - Week 20

This blog post was in the making for a while. I think I got distracted with many things (including a move to Berlin, more on that later). I am also started writing some posts on Medium, I will continue using it mostly for non-technical stuff.

For now let just publish this:

Burda Hackday - The Futute of Finance

I participated in the Burda Hackathon about the future of finance. Here's my report on the experience.

Colombian Salaries Survey.

A bit of self-promotion here. Famous Colombian entrepreneur Alexander Torrenegra recently released the results from a Survey about the salaries of Colombian Software Developers. I wrote an aticle Medium Medium that uses the data in a slightly different way.

Mining Massive Data Sets

This year I started some very interesting MOOCs, I couldn't complete all of them. However, I did finish Mining Massive Data Sets from Coursera and got the Certificate. I have to say that this is one of the best online course I have taken so far. I think the diversity and usefulness of the topics are high and made me look at topic I previously have seen in a different way. I can say the only thing I did not like from the course is Prof. Ullman. I mean he is great, but his videos in the course were the most "heavy" way less entertaining. However the overall course structure and videos are great. I would recommend this course to anyone wanting to deep dive into massive data processing. A background in machine

What the F**K

I found this nice tool called thefuck that tries to match a wrong command to a correct one using your command history. Basically if after writing a wrong command line you write fuck it will try to automatically fix it.

Pig DataFu

Lately I have been working with Pig. A Dataflow programming languages. Really useful for juggling with data. I have been writing UDFs and playing with some advanced functionality of it. At TrustYouwe use a library that I did not know before. It is called DataFu and it contains a lot of useful techniques.

Random Links

  • I finally watche dthe GapMinder presentation. Really interesting how the world have changed and how we still hold sort of old world views.
  • This Github repository repo contains interesting Notebooks, some of them extracted from nice books.
  • O'Reilly Launched https://beta.oreilly.com/ideas

Linklog - Week 11

Machine Learning / Data Science Links

A couple of links of interesing NLP related class from Standford NLP group (These are not MOOC, but some materials - maybe videos - are going to be posted online)

Other random links:

Data Journalism

A really nice tutorial on how to do data journalism (in German though) http://howto.ddjdach.de/. On the right side there various bars with links to nice tools to help different stages fo the data journalism (and to some extend) data science process.

Download Music from Soundcloud

We culd discuss the whole day the ethics of downlaoding a song from Soundcloud. However, sometimes is what you want to do. I found a little free application for Mac written by a Swiss guy.

http://black-burn.ch/scd/

You can thank me later.

Nu-Cumbia and Alternative-Folk

Talking of sound cloud I am creating a couple of list of music. One is from Cumbia remixes / convers of famous pop-songs (Yeah, that thing actually exists). The other is what I call Colombia Alternative-Folk. They are Colombian groups that play Colombian Folk music (there are many genres and subgenres) but with a twist.

Delete Training whitespace on Emacs (for Python)

I had issues keeping my Python source codes without trailing whitespaces. After a bit research I came up with my own version (just for Python files, as it was affecting other modes like Deft/Org).

(defun py-delete-whitespace ()
          (when (eq (buffer-local-value 'major-mode (current-buffer)) 'python-mode)
           (delete-trailing-whitespace)))

(add-hook 'before-save-hook 'py-delete-whitespace)

Learning a bit of e-lisp every day.

Spacemacs / Prelude Emacs Distributions

Apparently the last trend is to use neither Emacs nor Vim but a mixture of them. I found out a interesting Emacs distribution that aims to provide Vim-like functionality while keeping the Emacs power. I also found another interesting distribution called Prelude. I am always looking for ways to improve my workflow and Prelude seems to bring a batteries include distribution of Emacs. I am going to try it out both and decide whether to stay with my current Emacs configuration or switch to one of these distribution.

Mapfrappe

I found a nice tool called MAPfrappe "compare" maps and create cartographic "mixtures".

Munich on Berlin

Munich on Berlin

Berlin on Munich

Berlin on Munich

Linklog - Week 9

Open Data and Civic Apps

  • I found a interesting company called Data Made, from Chicago. A civic technology company working in projects related to open data.
  • They are also the creators of Dedupe, a Python project to perform data deduplication. It uses a similar approach to the one we use for hotel deduplication @TrustYou.

Colombian Software Developers

Recently two interesting articles about software developers in Colombia were written:

The first one is from Alexander Torrenegra and the other was written by Juan Pablo Buritica. Juan Pablo somehow rebuts Alexander some of Alexander's claims. I think Both make interesting points and I think they have the experience and the authority to talk about the subject. I share a bit more the opinion of Alexander as I think Computer Science education in Colombia is not good and need an overhaul. I would love to write a small piece on my experience as I graduated from a top Colombian university and I believe the education on computer science related topics is precarious.

Social Network Extravaganza

Thanks to the Mining Masive Data Sets class on Coursera and that I started getting curious curious about it. I wanted to explore a bit more and I downloaded the my friends dataset from Facebook and imported into Gephi, a tool for Graph visualization and analysis.

Gephi interface.

Below a couple of interesting links for people that want to do the same:

Install Gephi on Mac OSX Yosemite

There some issues with installing Gephi on the last version of OSX. This helped me solving the issue. http://sumnous.github.io/blog/2014/07/24/gephi-on-mac/

Extract your Friend Network in Facebook

I nice tool that helped me to extract my Facebook friend graph in a Gephi compatible file format. http://snacourse.com/getnet

Storify

https://storify.com/ I found this tool to be really nice to keep a log of event and add social network based stories.

SimString

http://www.chokkan.org/software/simstring/ I have been taken the MDDS class on Coursera. Challenging but super interesting. I wanted to use LSH to do a sort of hashing of near duplicate short sentences. It is a small dataset so I believe this algorithm will work actually better.

Delete Training whitespace on Save and compact Empty Line on Emacs

http://ergoemacs.org/emacs/elisp_compact_empty_lines.html Awesome snippet, specially for Python development on Emacs