Miguel Cabrera’s Blog

Databricks Berlin User Group: A Recap and a Surprise

2026-04-29T09:00:00+00:00

Last week I gave my first technical talk in years at the Databricks Berlin User Group. I want to write down a few things while they are still fresh, because they were not what I expected.

The talk

Databricks UG Berlin · 2026-04-28 · The takeaway slide

The premise: at Plato we ship 25+ ML algorithms across 50+ wholesale-distributor tenants, and we do it without hand-rolling deployments. The plumbing under that is Databricks Asset Bundles (recently rebranded to Declarative Automation Bundles, or DABs) plus a small generator we wrote called dabgen, plus a Claude Code skill that drives the onboarding end-to-end.

Title was “From Days to Minutes: How We Taught an AI to Onboard 50+ Tenants on our AI Features.” Slides are on Speaker Deck.

It went well. The food was great. The conversations after were better.

The surprise

I had built the talk assuming most of the room was already using DABs in production, and most engineers in the room were using AI coding assistants daily. The whole framing was “advanced workflow tricks.” Stuff like: how to layer overrides cleanly, where to draw the line between Jinja templates and runtime config, what a CI/CD pipeline looks like when an AI is the one writing your tenant configs.

Turns out fewer people than I thought were actually using DABs in production. Same story with the AI coding wave (Claude Code, Codex, Cursor, the whole stack). I was talking like these were table stakes. They are not, yet.

That changed how I read my own talk afterwards. For a lot of the audience, the value was not in the specific tricks. It was in seeing that this stuff is real, and that small teams are running it, and that the resulting workflow is calmer than the one they have today. Good signal for the next version of the talk. Less “here is the advanced pattern,” more “here is what a working setup actually looks like, and why the boring parts matter.”

Two questions worth writing down

Two of the questions during Q&A are still in my head, because the answers are the kind of thing I had not bothered to articulate before someone asked.

“Why skills and not just scripts?” This came up because half of what a skill ends up doing is “run a thing, parse the output, decide what to do next.” Which is what scripts do. So why the indirection?

The honest answer is that a script encodes one path. A skill encodes a capability: the description, the inputs it expects, the failure modes it knows about, the tools it composes. When something goes sideways (and at 50 tenants, something is always going sideways), a script gives up at the first unhandled case. A skill negotiates: it tries an alternate path, asks the operator for a confirmation, drops back to a fallback tool. The five-tier scaffolding around the skill is what makes that negotiation actually go somewhere instead of into a loop.

The TL;DR I gave at the meetup: scripts are great when the world is fixed. Skills earn their keep when the world is messy and you want the agent to keep going. (More on this in a follow-up.)

“MCP server or CLI tool?” This one I have a strong opinion on. I prefer CLIs.

Two reasons. First, every MCP roundtrip is tokens. Tool definitions, schemas, the wrapper boilerplate. It adds up fast on long tasks, and on a tenant onboarding the agent is spending most of its budget on glue, not on actually thinking about your problem. A python query_databricks.py "..." call is one tool invocation, one stdout, done. Specifically the Databricks SQL MCP is the one we kept reaching for, and the one a small query_databricks.py script replaces nicely. Second, CLIs degrade better. When the hosted MCP service has a hiccup mid-flow (which has happened to us in production), the agent that also knows how to invoke the local CLI finishes the job. The one that only knows the MCP gets stuck. So our preference is: build the CLI first, expose it as an MCP later if it earns the convenience tax.

I want to write up the MCP-vs-CLI argument properly, because it cuts against the default advice you’ll see, and it has cost-and-reliability evidence behind it. That’s on the queue.

The personal part

This was my first talk since before corona. Five-ish years of public-speaking rust. I had forgotten how much I miss the feeling of a real audience asking real questions, instead of typing into a void on LinkedIn.

What’s next

I am going to expand a few of the bits I had to cut from the slides into separate posts. The current shortlist:

Knowledge scaffolding for AI agents. The five-tier pattern (data model → CLAUDE.md → rules → skills → tools) we use to make a coding agent productive in a real production codebase. This is the most stealable idea from the talk and the one I get the most questions about.
Generator of generators. What dabgen actually does, and why “the same Jinja template renders the bundle template and the bundle” turned out to be the right design.
MCP vs CLI: a token and reliability argument. The longer version of the answer above. Why we default to CLIs, when MCP is worth the cost, and what we measured.
Tool-teaching beats prompt-engineering. When a hosted MCP drops mid-flow, the agent that also knows how to call your fallback Python script will finish the job. The one that just had a great prompt will not.

If you run a Databricks or AI engineering event in Europe and want a longer version of any of this, you can find me on LinkedIn.

The slides

From Days to Minutes · Databricks UG Berlin · April 2026

Data Verification for Machine Learning - A Review of DataFrame Validation Libraries

2021-10-21T07:27:47+00:00

TL;DR

In this blog post, I review some interesting libraries for checking the quality of the data using Pandas and Spark data frames (and similar implementations). This is not a tutorial (I was actually trying out some of the tools while I wrote) but rather a review of sorts, so expect to find some opinions along the way.

Intro - Why Data Quality?

Data quality might be one of the areas Data scientists tend to overlook the most. Why? Well, let’s face it, It is boring and most of the time it is cumbersome to perform data validation. Furthermore, sometimes you do not know if your effort is going to pay off. Luckily, some libraries can help with this laborious task and standardize the process in a Data Science team or even across an organization.

But first things first. Why I would choose to spend my time doing data quality checks, while I can spend my time writing some amazing code that trains a bleeding-edge deep convolutional logistic regression? Here are a couple of reasons:

It is hard to ensure data constraints in the source system. Particularly true for legacy systems.
Companies rely on data to guide business decisions (forecasting, buying decisions), and missing or incorrect data affect those decisions.
The trend to feed ML systems with this data (these systems are often highly sensitive to input data as the deployed model relies on the assumption on the characteristics of the inputs).
Subtle errors introduced by changes in the data can be hard to detect.

Data Quality Dimensions

The quality of the data can refer to the extension of the data (data values) or to the intension (not a typo) of the data (schema).

Extension Dimension

Extracted from Schelter et al. (2018):

Completeness: The degree on which an entity includes data required to describe a real-world object. Presence of null values (missing values). Depends on context.

Example: Notebooks might not have the shirt_size property.

Consistency: The degree to which a set of semantic rules are violated.
- Valid range of values (e.g. sizes {S, M, L})
- There might be intra-relation constraint, e.g. if the category is “shoes” then the sizes should be in the range 30-50.
- Inter-relation constraints may involve multiple tables and columns. product_id may only contain entries from the product table.
Accuracy: The correctness of the data and can be measured in two ways, semantic and syntactic.
- Syntactic: Compares the representation of a value with a corresponding definition domain.
- Semantic: Compares a value with its real world representation.

Example: blue is a syntactically valid value for the column color (even if a product is of color red). XL would neither semantically nor syntactically accurate.

Most of the data quality libraries I am going to explore focus on the extension dimension. This is particularly important when the data ingested comes from semi-structured or non-curated sources. On the intension of the data is where the richest set of checks can be done (i.e. checking the schema would only verify if a field is of a certain type but not some additional logical like that what are the valid values for a string field).

Libraries

The following are the libraries I will quickly evaluate. The idea is to display writing quality checks works and describe a bit of the workflow. I selected these libraries as are the ones I have either been using, reading about, or seeing at conferences. If there is a library that you think should make the list, please let me know in the comment section.

Great Expectations
Pandera
Deequ/PyDeequ

Sample Data

I will use a sample dataset to exemplify how different libraries will check similar properties:

import pandas as pd
df = pd.DataFrame(
       [
           (1, "Thingy A", "awesome thing.", "high", 0),
           (2, "Thingy B", "available at http://thingb.com", None, 0),
           (3, None, None, "low", 5),
           (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
           (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

id	productName	description	priority	numViews
1	Thingy A	awesome thing.	high	0
2	Thingy B	available at http://thingb.com	None	0
3	None	None	low	5
4	Thingy D	checkout https://thingd.ca	low	10
5	Thingy E	None	high	12

Things that I will check on this toy data:

there are 5 rows in total.
values of the id attribute are never Null/None and unique.
values of the productName attribute are never null/None.
the priority attribute can only contain “high” or “low” as value.
numViews should not contain negative values.
at least half of the values in description should contain a url.
the median of numViews should be less than or equal to 10.
The productName column contents matches the regex r'Thingy [A-Z]+'

Great Expectations

Calling Great Expectation (GE) as library is a bit of an understatement. This is a full-fledged framework for data validation, leveraging existing tools like Jupyter Notebook and integrating with several data stores for validating data originating from them as well storing the validation results.

The main concept of Great Expectations (GE) are well expectations, that as the name indicate, run assertions on expected values of a particular column.

The simplest way to use GE is to wrap the dataframe or data source with a GE DataSet and quickly check individual conditions. This is useful for exploring the data and refining the data quality check.

import great_expectations as ge
ge_df = ge.from_pandas(df)
ge_df.expect_table_row_count_to_equal(5)
ge_df.expect_column_values_to_not_be_null("id")
ge_df.expect_column_values_to_not_be_null("description")
ge_df.expect_column_values_to_be_in_set("priority", {"high", "low"})
ge_df.expect_column_values_to_be_between("numViews", 0)
print(ge_df.expect_column_median_to_be_between("numViews", 0, 10))

If run interactively in a Notebook, for each expectation we get a json representation of the expectation as well some metadata regarding the values and whether the expectation failed:

{
  "expectation_config": {
    "meta": {},
    "expectation_type": "expect_column_median_to_be_between",
    "kwargs": {
      "column": "numViews",
      "min_value": 0,
      "max_value": 10,
      "result_format": "BASIC"
    }
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 5.0,
    "element_count": 5,
    "missing_count": null,
    "missing_percent": null
  }
}

However this is not the optimal way to use GE. The documentation states that is better to properly configure the datasets and generate a standard directory structure. This is done through a Data Context and requires some scaffolding and generating some files using the command line:

[miguelc@machine]$ great_expectations --v3-api init

Using v3 (Batch Request) API

  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~

Let's configure a new Data Context.

First, Great Expectations will create a new directory:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- notebooks
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- documentation
 (...)

Basically, the process goes as follows:

Generate the directory structure (using for example the command above)
Generate a new data source. You can select - This opens a Jupyter notebook where you configure the data source and store the configuration under great_expectations.yml
Create the expectation suite, using the built-in expectations using also Jupyter Notebooks. You store the expectations as json in the expectations' directory. A nice way to get started is to use the automated data profiler that examines that data source and generates the expectations.
Once you execute the notebook, the data docs are shown. Data docs show the result of the expectations and other metadata in a nice HTML format that can be useful to learn more about the data.

Once you have created the initial set of expectations you can edit them using the command great_expectations --v3-api suite edit articles.warning. You will have to choose whether you want to interact with a batch (sample) of data or not. This will also open a Notebook where you depending on your choice will be able to edit the existing expectations in slightly different ways.

Now that you have your expectations set up you can then use them to validate a new batch of data. For that, you need to learn a new additional concept called Checkpoints. A Checkpoint bundles Batches of data with corresponding Expectation Suites for validation. To create a checkpoint you need, you guessed right, another nice command line and another Jupyter Notebook.

[miguelc@machine]$ great_expectations --v3-api checkpoint new my_checkpoint

If you can execute the above command, it will open a Jupyter Notebook where you can then configure a bunch of stuff using YAML. The key idea here is that with this Checkpoint you link an expectation_suite with a particular data asset coming from a data source.

Optionally, you can run the checkpoint (the full expectation on the data source) and see the results on the already familiar data_docs interface.

As for deployment. one pattern would be to run the checkpoint as a task in some sort of workflow manager (such as Airflow or Luigi), you can also run the Checkpoints programmatically using python or straight from the terminal.

I recently found out that if you use dbt, you get GE installed by default and can be used to extend the unit tests of the SQL queries you write.

The Good

Interactive validation and expectation testing. The instant feedback helps to refine and add checks for data.
When an expectation fails, you get a sample of the data that does make the expectation fail. This is useful for debugging.
It is not limited to pandas data frames, it comes with support for many data sources including SQL databases (via SQLAlchemy) and Spark dataframes.

The not so good

Seems heavy and full of things. Getting started might not be as easy as there are many concepts to master.
Although it might seem natural for many potential users, the coupling with Jupyter Notebook/Lab might make some uncomfortable.
Expectations are stored as JSON instead of code.
They received some funding recently and they are changing many of already existing (and already large) concepts and API, making the whole process of learning even more challenging.

Pandera

Pandera is “statistical data validation for pandas”. Using Pandera is simple, after installing the package you have to define a Schema object where each column has a set of checks. Columns might be optionally nullable. That is, checking for nulls is not a check per se but a quality/characteristic of a column.

import pandas as pd
import pandera as pa

df = pd.DataFrame(
       [
           (1, "Thingy A", "awesome thing.", "high", 0),
           (2, "Thingy B", "available at http://thingb.com", None, 0),
           (3, None, None, "low", 5),
           (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
           (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

schema = pa.DataFrameSchema({
    "id": pa.Column(int, nullable=False),
    "description": pa.Column(str, nullable=False),
    "priority": pa.Column(str, checks=pa.Check.isin(["high", "low"]), nullable=True),
    "numViews": pa.Column(int, checks=[
        pa.Check.greater_than_or_equal_to(0),
        pa.Check(lambda c: c.median() >= 0 and c.median() <= 10)
        ]
    ),
    "productName": pa.Column(str, nullable=False),

})

validated_df = schema(df)
print(validated_df)

If you run the validation an exception will be raised:

Traceback (most recent call last):
  File "", line 26, in 
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 648, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 594, in validate
    error_handler.collect_error("schema_component_check", err)
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 586, in validate
    result = schema_component(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1826, in __call__
    return self.validate(
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 214, in validate
    validate_column(check_obj, column_name)
  File ".../lib/python3.9/site-packages/pandera/schema_components.py", line 187, in validate_column
    super(Column, copy(self).set_name(column_name)).validate(
  File ".../lib/python3.9/site-packages/pandera/schemas.py", line 1720, in validate
    error_handler.collect_error(
  File ".../lib/python3.9/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: non-nullable series 'description' contains null values: {2: None, 4: None}

The code would look similar to other data validation libraries (e.g. Marshmallow). Also, compared to GE the library offers the Schema abstraction, which you might or not like it.

With Pandera, if a check fails, it will raise a proper exception (you can disable this and turn it into a RuntimeWarning). Depending on how you might want to integrate the checks into the larger pipeline, this might be useful or plainly annoying. Furthermore, if you look closely, Pandera only displays one validation error as the cause of the validation error, although there is more than one column that does not comply with the specification.

Given that this is Python library is relatively easy to integrate into any existing pipeline. It can be a task in Luigi/Airflow for example or something that could be run as part of a larger task.

The Good

Familiar API based on schema checking that makes the library easy to get started with.
Support for hypothesis testing on the columns.
Data profiling and recommendation of checks that could be relevant.

The not so good

Very few checks included under the pa.Check class
The message is not very informative if the check is done through a lambda function.
Errors during the checking procedure will raise a run-time exception by default.
It apparently only works with Pandas, it is not clear if it would work with any other implementation or Spark.
I did not find a way to test for properties on the size of the dataframe or to do comparisons across different runs (i.e. the number of rows should not decrease between runs of the check).

Deequ/PyDeequ

Last but not least, let us talk about Deequ. Deequ a data checking library written in Scala targeted towards Spark/PySpark dataframes and thus aims to check large datasets making use of Spark optimization to run in a performant manner. PyDeequ, as the name implies, is a Python wrapper offering the same API for pySpark.

The idea behind deequ is to create “unit tests for data”, to do that, Deequ calculates Metrics through Analyzers, and assertions are verified based on that metric. A Check is a set of assertions to be checked. One interesting feature of (Py)Deequ is that it allows comparing metrics across different runs, allowing to perform assertions on changes on the data (e.g. an unexpected jump in the number of rows of a dataframe).

from pydeequ.checks import Check
from pydeequ.verification import VerificationSuite

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = (
    VerificationSuite(spark)
    .onData(df)
    .addCheck(
        check
        .hasSize(lambda sz: sz == 5)  # we expect 5 rows
          .isComplete("id")  # should never be None/Null
          .isUnique("id")  # should not contain duplicates
          .isComplete("productName")  # should never be None/Null
          .isContained_in("priority", ("high", "low"))
          .isNonNegative("numViews")
          # at least half of the descriptions should contain a url
          .containsUrl("description", lambda d: d >= 0.5)
          # half of the items should have less than 10 views
          .hasQuantile("numViews", 0.5, lambda v: v <= 10)
        )
    .run()
)

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

After calling run, PyDeequ will compute some metrics on the data. Afterwards it invokes your assertion functions (e.g., lambda sz: sz == 5 for the size check) on these metrics to see if the constraints hold on the data. The metrics calculated can be stored in a MetricRepository (e.g. S3 or disk) for future reference and to make comparison between metrics of different runs.

(Py)Deequ allows for differential calculations of metrics, that is, the metrics calculated for a dataset can be updated when the data increases without having to recalculate the metrics from the whole dataset.

Another unique feature of (Py)Deequ is anomaly detection, whereas GreatExpections allows for single thresholds, (Py)Deequ allows for a checks based on a running average and standard deviation of the metrics calculated.

Similar to Pandera, PyDeequ is easy to integrate to your existing code base as it is PySpark/Python code.

Deequ for Pandas DataFrames

You might be wondering if you can use (Py)Deequ for Pandas, and it is sadly not possible. However, almost a year ago I developed an experimental port Deequ to Pandas. I called it Hooqu. However, due to personal constraints, I haven’t been able to maintain it, but it is still functional (albeit by using a lot of Pandas hacks) and you can install it via pip.

The Good

Use PySpark to parallelize otherwise expensive checks.
Support for external metric repositories.
Data profiling.
Constraint suggestion.

The not so good

This is not a pure Python project, rather a wrapper over a Scala/Spark library, and thus the code might not look pythonic.
Only make sense to use it if you are already using a (py)Spark cluster.
It is your responsibility to load the data from whenever it resides into a Spark dataframe. There are no “connectors” or “loaders” off-the-shelf.

Comparison table

Let’s finish with a table summarizing the features of the different libraries:

Feature	GreatExpectations	Pandera	PyDeequ
Checks Extension dimension (Values)	✓	✓	✓
Checks the intension dimension (Schema)	✗	✓	✗
Pandas support¹	✓	✓	✗
Spark support	✓	✗	✓
Multiple data sources (Database loaders, etc.)	✓	✗	✗
Data Profiling	✓	✗	✓
Constraint/Check Suggestion	✓	✗	✓
Hypothesis Testing	✗	✓	✗
Incremental computation of the checks	✗	✗	✓
Simple Anomaly Detection	✓	✗	✓
Complex Anomaly Detection²	✗	✗	✓

Hooqu offers a PyDeequ-like API for Pandas dataframes.
Using running averages and standard deviation of incremental computation.

Final Notes

So, after all this deluge of information, which library should I use?. Well, all these libraries have their strong points and the best choice will depend on your goal, which environment are you familiar with, and the sort of checks you want to perform.

For small Pandas-heavy projects, I would recommend using Pandera (or Hooqu if you are a brave soul). If your organization is larger, you like Jupyter notebooks, and you do not mind the learning curve, I would recommend GreatExpectations as it has currently a lot of traction. If you write your pipelines mostly in (Py)Spark and you care about performance I would go for (Py)Deequ. Both are Apache projects, are easy to integrate with your codebase, and will make better use of your Spark cluster.

Testing Spark tasks with PyTest, Mock and Luigi

2017-09-17T12:18:52+00:00

TL;DR

In this blog post I describe briefly how to test PySpark tasks using a combination of Luigi, PyTest and Mock.

Intro

At TrustYou we have a lot of Hadoop streaming Python jobs. Most of them are written in Python (and some in Pig). One of the things that bothered me a lot of working in such way is that testing may become complicated as simulating the cluster setting might impose some restrictions.

Although not the only reason, the complexity of testing such types of processing pipelines might contribute to ignore testing part, mostly under the believe that it is not needed or worth. The trickiest part is that problems in a particular part a data processing pipeline might only become evident in a upstream stage, making debugging difficult.

Luckily, Spark and PySpark make testing simpler as they allow to run Spark application in local cluster making available all the high level abstractions such as DataFrames. This combined with Pytest, Luigi and Pytest-Fixtures.

PySpark Tasks with Luigi

Let’s start with the basics of how to run a PySpark with Luigi. Luigi has the concept of Task, which is basically a step in a data pipeline. For example dumping data from a database or running a MapReduce job. To run a Spark job you simple need to set the spark configuration in the Luigi configuration file (luigi.cfg) and create a class that inherit from luigi.contrib.spark.PySparkTask:

from luigi.contrib.spark import PySparkTask
from luigi.contrib.hdfs import HdfsTarget

class SamplePySparkTask(PySparkTask):
    # Spark options can be set a class attributes
    driver_memory = '4g'
    executor_memory = '16g'
    num_executors = 8
    executor_cores = 2

    def main(self, sparkContext):
        # This is where implement the method
        pass

    def output(self):
        # After executing main this file should exists for this task to be considered completed
        return HdfsTarget('myresult.txt')

    def requires(self):
        # This should return either a task of a required file.
        pass

Above is the basic structure of a task. The method main receives the Spark context as variable. For Luigi it does not matter what we do with the context as long as we have the output declared in the output method.

Now Let’s test for example a Task that loads a CSV with the following structure.

TODO: find out how make a good looking table with Bootstrap

Our little Spark task will group by user and get the average and will output the result as JSON Line file using the following format.

{
  "customer": "Mario X.",
  "month": "June",
  "average": 123.42
}

The necessary Luigi configuration would be as following (assuming Spark is installed):

[spark]
master: local

Testing with fixtures

For running a Luigi pipe we require to have a luigi configuration loaded into memory. In a real world pipe it will contain luigi specific configuration along with application specific setting.

This is a good example for a fixture…

Note: This post appears to be incomplete in the original source.

Getting Fancier - Using Hypothesis to generate test data

To be continued…

Schema Testing with JSON schemas and Voluptuous

To be continued…

Using mypy for Improving your Codebase

2017-05-14T12:18:52+00:00

TL;DR

In this article I use mypy to document and add static type checking to an existing codebase and I describe the reasons why I believe using mypy can help in the refactoring and documentation of legacy code while following the The Boy Scout Rule.

Intro

We all love Python, it is a multi-paradigm dynamic programming language very popular in Data Science and Machine Learning. Besides some small quirky things in the language, I am quite happy with how it is evolving. However, there are some areas where I thought Python could do better for improving programming productivity in specific contexts:

While is easy to hack around scripts and get something running, managing a large complex codebase becomes an issue. You can get something working really fast, but maintaining it can become an issue if your code base becomes large enough.
Many times while reading other people’s code (heck, even my own code), and even when documented, it is really hard to figure out what a method or function is doing without a clear knowledge of the types you are working with. In many cases having just the type information (i.e. via a simple comment) would make understanding the code a whole lot faster.

I have also spent a lot of time debugging just because the wrong type was passed to a function/method (e.g. the wrong variable was passed to a method, wrong argument order, etc.). Because of Python’s dynamic typing the interpreter and/or linter could not warn me. Plus, some of those errors only were evident at execution time, generally in edge cases.

Although we all like working on greenfield projects, in the real world you will have to work with legacy code and it will generally be ugly and full of issues. Let’s take a look at at some Python 2.7 legacy code I have to maintain:

# snnipets.py
def get_hotel_type_snippets(self, hotel_type_id, cat_set):
    snippets = self.get_snippets(hotel_type_id, "pos")
    snippets += list(it.chain.from_iterable(
        self.get_snippets(
            rel_cat,
            cat_set[rel_cat].sentiment
        )
        for rel_cat
        in cat_set[hotel_type_id].cat_def.related_cats
        if rel_cat in cat_set and cat_set[rel_cat].sentiment == "pos"
    ))
    return snippets[:self.max_snippets]

Don’t focus too much on the fact that it has no documentation and forget about the ugly comprehension inside.

In order to understand this code I have to answer the following questions:

What type is hotel_type_id (is it an int?)
What type is cat_set, it looks like a dictionary containing something else.

These two issues could be fixed with a proper docstring, however comments sometimes don’t contain all the information required, don’t include the type of the parameters being passed or can be easily inconsistent as the code might have been changed but the comment not updated.

If I want to understand the code I will have to look for its usage, maybe grepping through the code for something called related_cats or sentiment. If you have a large codebase, you might even find many classes implementing the same method name.

I have two choices when I need to modify existing code like this. I can either hack my way around, modifying it enough to make it do what I want, or I can look for a way to make this code better (i.e. the The Boy Scout Rule). Besides adding the needed documentation, it would be cool to have a way to specify the types that could be potentially used by a static linter.

Enter mypy

Luckily I was not the only one with this problem (or desire), and that’s one of the reasons PEP-484 came to life. The goal is to provide Python with optional type annotations that allow an offline static linter to check for type issues. However I believe making the code easier to understand (via type documentation) is an awesome side-product.

There is an implementation of this PEP called mypy that is in fact the inspiration for the first. Mypy provides a static type checker that works in Python 3 (using type annotations) and Python 2.7 (using specific crafted comments).

At TrustYou we have a lot of Python 2.7 legacy code that suffers many of the issues mentioned above, so I decided to give it a try in a new project I was working on and I have to say it helped catch some issues early in the development stage. I also tried in it in an existing code base that because of its structure was hard to read.

Let’s go back to the example code I shared before and let’s document the code using type annotations:

from typing import Any, List, Dict
from metaprecomp.tops_flops_bake.category import CategorySet

def get_hotel_type_snippets(self, hotel_type_id, cat_set):
    # type: (str, CategorySet) -> List[Dict[str, Any]]

    snippets = self.get_snippets(hotel_type_id, "pos")
    # (...) as before

As you might guess, (str, Category) are the types of the method parameters. What follows -> is the return type, in this example, a list of dictionaries from str to Any. Any is a catch all-type. It helps when you don’t know they type (in this case, i would have had to read the code further, and I was too lazy) or when the function can return literally any type.

Some notes from the code above:

You might have noticed the from typing import Any, ..., the typing library brings the required types into Python 2.7, even when used only as comments. So yeah, you will need to add it to your requirements.txt.
You also noticed I had to import explicitly CategorySet from the category model (even if I used it as a comment). I find that good as I am stating there’s a relationship or dependency between those modules.
Finally, you also noticed the # noqa: F401. This is to avoid flake8 or pylint to complain about unused imports. This is not nice, but it is minor annoyance.

Installing and running mypy

So far we have used mypy syntax (actually PEP 484 - Type Hints) to do some annotation, but all this hassle should bring something to the table besides a nifty documentation. So let’s install mypy and try the command line.

Running mypy requires a Python 3 environment so if your main Python environment is 2.7 you will need to install it in a separate one. Luckly you can call the binary directly (even when your Py27 environment is activated). I you use Anaconda you can easily create a dedicated environment for mypy:

[miguelc@machine]$ conda create -n mypy python=3.6
(...)
[miguelc@machine]$ source activate mypy
(mypy)[miguelc@machine]$ pip install mypy  # to get the latest mypy
(mypy)[miguelc@machine]$ ln -s `which mypy` $HOME/bin/mypy   # I have $HOME/bin in my $PATH
(mypy)[miguelc@machine]$ source deactivate
[miguelc@machine]$ mypy --help    # this should work

With that out of the way, we can start using mypy executable for checking our source code. I run mypy the following way:

[miguelc@machine]$ mypy --py2 --ignore-missing-imports  --check-untyped-defs  [directory or files]

--py2: indicates that the code to check is a Python 2 codebase.
--ignore-missing-imports tells mypy to ignore error messages when imports cannot be resolved, e.g. when they don’t exist on the env mypy is running.
--check-untyped-defs: checks functions but does not fail if the arguments are not typed.

The command line tool provides a lot of options and the documentation is very good. An interesting feature is that it allows you to generate reports that can be displayed using CI tools like Jenkins.

Checking for type errors

Let’s take a look at another method I annoated for the purpose of exemplifying the type of errors you can find using mypy after adding type annotations:

from typing import Any, List, Dict, FrozenSet  # noqa: F401

def get_snippets(
        self, category_id, sentiment,
        pos_contradictory_subcat_ids=frozenset(),
        neg_contradictory_subcat_ids=frozenset()):
        # type: (str, str, FrozenSet[str],  FrozenSet[str]) -> List[Dict[str, str]]

        # (...) not relevant code...

Indeed, another method with no documentation whatsoever. So I had to read a little bit of the code to figure out what are the input and return types. Now let’s imagine that somewhere in the code something like this happens:

# bake_reduce.py
cat = 13
# (...)
snippets_generator = SnippetsGenerator(
    snippets_by_cat_sent,
    self.metacategory_bundle[lang]
)
snippets_generator.get_snippets(cat, "pos")

If I run mypy I would get the following error:

[miguelc@machine]$ mypy --ignore-missing-imports  --check-untyped-defs  --py2  metaprecomp/tops_flops_bake/bake_reduce.py
metaprecomp/tops_flops_bake/bake_reduce.py:238: error: Argument 1 to "get_snippets" of "SnippetsGenerator" has incompatible type "int"; expected "str"

If you come from the static typed language world this should look really normal to you, but for Python developers finding an error like this (in particular in large code bases) requires to spend quite a bit of time debugging (and sometimes the use of Voodoo magic).

When to use mypy

Optional type annotations are that, optional. You can start hacking as normal using the speed that Python dynamic typing gives you and once your code is stable enough you can gradually add type annotations to help avoid bugs and to document the code. The mypy FAQ contains some scenarios in which a project will benefit from using static type annotations:

Your project is large or complex.
Your codebase must be maintained for a long time.
Multiple developers are working on the same code.
Running tests takes a lot of time or work (type checking may help you find errors early in development, reducing the number of testing iterations).
Some project members (devs or management) don’t like dynamic typing, but others prefer dynamic typing and Python syntax. Mypy could be a solution that everybody finds easy to accept.
You want to future-proof your project even if currently none of the above really apply.

In the particular case of my team, a lot of the code we write ends up running for quite a long time inside of MapReduce (Hadoop) jobs, so being able to detect bugs ahead of time would save precious developer time and make everyone happier.

Adding support to Emacs

By now you might be thinking that it would be cool to integrate mypy checks into your editor. Some, like PyCharm, already support this. For Emacs you can integrate mypy into Flycheck via flycheck-mypy. You can install it via M-x package-install flycheck-mypy. Configuring it is a matter of setting a couple of variables:

(set-variable 'flycheck-python-mypy-executable "/Users/miguel/anaconda2/envs/py35/mypy/mypy")
(set-variable 'flycheck-python-mypy-args '("--py2"  "--ignore-missing-imports" "--check-untyped-defs"))

Mypy recommends disabling all other linters/checkers like flake8 and others when using it, however I wanted to keep both at the same time (call me paranoid). In Emacs, you can accomplish this with the following configuration:

(flycheck-add-next-checker 'python-flake8 'python-mypy)

Final words and references

Using mypy won’t magically find errors in your code, it will be as good as the type annotations you add and the way you structure the code. Also, it is not a replacement for proper documentation. Sometimes there are methods/functions that become easier to read just by adding type annotations, but documenting key parts of the code is vital for ensuring code maintainability and extensibility.

I did not mention all the features of mypy so please check official documentation to learn more.

There are a couple of talks that can serve as a nice introduction to the topic:

Introducing Type Annotations for Python - by Guido, Greg Price and David Fisher
Static Types for Python PyCon 2017 - by Jukka Lehtosalo and David Fisher

The first one of them is given by Guido, who’s pushing the project a lot. Thus, I expect mypy to become more popular in the following years. Happy hacking.