# Using mypy for Improving your Codebase

## TL;DR

In this article I use mypy to document and add static type checking to an existing codebase and I describe the reasons why I believe using mypy can help in the refactoring and documentation of legacy code while following the The Boy Scout Rule.

## Intro

We all love Python, it is a multi-paradigm dynamic programming language very popular in Data Science and Machine Learning. Besides some small quirky things in the language, I am quite happy with how it is evolving. However, there are some areas where I thought Python could do better for improving programming productivity in specific contexts:

• While is easy to hack around scripts and get something running, managing a large complex codebase becomes an issue. You can get something working really fast, but maintaining it can become an issue if your code base becomes large enough.
• Many times while reading other people's code (heck, even my own code), and even when documented, it is really hard to figure out what a method or function is doing without a clear knowledge of the types you are working with. In many cases having just the type information (i.e. via a simple comment) would make understanding the code a whole lot faster.

I have also spent a lot of time debugging just because the wrong type was passed to a function/method (e.g. the wrong variable was passed to a method, wrong argument order, etc.). Because of Python's dynamic typing the interpreter and/or linter could not warn me. Plus, some of those errors only were evident at execution time, generally in edge cases.

Although we all like working on greenfield projects, in the real world you will have to work with legacy code and it will generally be ugly and full of issues. Let's take a look at at some Python 2.7 legacy code I have to maintain:

# snnipets.py
def get_hotel_type_snippets(self, hotel_type_id, cat_set):
snippets = self.get_snippets(hotel_type_id, "pos")
snippets += list(it.chain.from_iterable(
self.get_snippets(
rel_cat,
cat_set[rel_cat].sentiment
)
for rel_cat
in cat_set[hotel_type_id].cat_def.related_cats
if rel_cat in cat_set and cat_set[rel_cat].sentiment == "pos"
))
return snippets[:self.max_snippets]


Don't focus too much on the fact that it has no documentation and forget about the ugly comprehension inside.

In order to understand this code I have to answer the following questions:

• What type is hotel_type_id (is it an int?)
• What type is cat_set, it looks like a dictionary containing something else.

These two issues could be fixed with a proper docstring, however comments sometimes don't contain all the information required, don't include the type of the parameters being passed or can be easily inconsistent as the code might have been changed but the comment not updated.

If I want to understand the code I will have to look for its usage, maybe grepping through the code for something called related_cats or sentiment. If you have a large codebase, you might even find many classes implementing the same method name.

I have two choices when I need to modify existing code like this. I can either hack my way around, modifying it enough to make it do what I want, or I can look for a way to make this code better (i.e. the The Boy Scout Rule). Besides adding the needed documentation, it would be cool to have a way to specify the types that could be potentially used by a static linter.

## Enter mypy

Luckily I was not the only one with this problem (or desire), and that's one of the reasons PEP-484 came to life. The goal is to provide Python with optional type annotations that allow an offline static linter to check for type issues. However I believe making the code easier to understand (via type documentation) is an awesome side-product.

There is an implementation of this PEP called mypy that is in fact the inspiration for the first. Mypy provides a static type checker that works in Python 3 (using type annotations) and Python 2.7 (using specific crafted comments).

At TrustYou we have a lot of Python 2.7 legacy code that suffers many of the issues mentioned above, so I decided to give it a try in a new project I was working on and I have to say it helped catch some issues early in the development stage. I also tried in it in an existing code base that because of its structure was hard to read.

Let's go back to the example code I shared before and let's document the code using type annotations:

from typing import Any, List, Dict
from metaprecomp.tops_flops_bake.category import CategorySet

def get_hotel_type_snippets(self, hotel_type_id, cat_set):
# type: (str, CategorySet) -> List[Dict[str, Any]]

snippets = self.get_snippets(hotel_type_id, "pos")
# (...) as before


As you might guess, (str, Category) are the types of the method parameters. What follows -> is the return type, in this example, a list of dictionaries from str to Any. Any is a catch all-type. It helps when you don't know they type (in this case, i would have had to read the code further, and I was too lazy) or when the function can return literally any type.

Some notes from the code above:

• You might have noticed the from typing import Any, ..., the typing library brings the required types into Python 2.7, even when used only as comments. So yeah, you will need to add it to your requirements.txt.
• You also noticed I had to import explicitly CategorySet from the category model (even if I used it as a comment). I find that good as I am stating there's a relationship or dependency between those modules.
• Finally, you also noticed the # noqa: F401. This is to avoid flake8 or pylint to complain about unused imports. This is not nice, but it is minor annoyance.

## Installing and running mypy

So far we have used mypy syntax (actually PEP 484 - Type Hints) to do some annotation, but all this hassle should bring something to the table besides a nifty documentation. So let's install mypy and try the command line.

Running mypy requires a Python 3 environment so if your main Python environment is 2.7 you will need to install it in a separate one. Luckly you can call the binary directly (even when your Py27 environment is activated). I you use Anaconda you can easily create a dedicated environment for mypy:

[miguelc@machine]$conda create -n mypy python=3.6 (...) [miguelc@machine]$ source activate mypy
(mypy)[miguelc@machine]$pip install mypy # to get the latest mypy (mypy)[miguelc@machine]$ ln -s which mypy $HOME/bin/mypy # I have$HOME/bin in my $PATH (mypy)[miguelc@machine]$ source deactivate
[miguelc@machine]$mypy --help # this should work  With that out of the way, we can start using mypy executable for checking our source code. I run mypy the following way: [miguelc@machine]$ mypy --py2 --ignore-missing-imports  --check-untyped-defs  [directory or files]

• --py2: indicates that the code to check is a Python 2 codebase.
• --ignore-missing-imports tells mypy to ignore error messages when imports cannot be resolved, e.g. when they don't exist on the env mypy is running.
• --check-untyped-defs: checks functions but does not fail if the arguments are not typed.

The command line tool provides a lot of options and the documentation is very good. An interesting feature is that it allows you to generate reports that can be displayed using CI tools like Jenkins.

## Checking for type errors

Let's take a look at another method I annoated for the purpose of exemplifying the type of errors you can find using mypy after adding type annotations:

from typing import Any, List, Dict, FrozenSet  # noqa: F401

def get_snippets(
self, category_id, sentiment,
# type: (str, str, FrozenSet[str],  FrozenSet[str]) -> List[Dict[str, str]]

# (...) not relevant code...


Indeed, another method with no documentation whatsoever. So I had to read a little bit of the code to figure out what are the input and return types. Now let's imagine that somewhere in the code something like this happens:

# bake_reduce.py
cat = 13
# (...)
snippets_generator = SnippetsGenerator(
snippets_by_cat_sent,
self.metacategory_bundle[lang]
)
snippets_generator.get_snippets(cat, "pos")


If I run mypy I would get the following error:

[miguelc@machine]$mypy --ignore-missing-imports --check-untyped-defs --py2 metaprecomp/tops_flops_bake/bake_reduce.py metaprecomp/tops_flops_bake/bake_reduce.py:238: error: Argument 1 to "get_snippets" of "SnippetsGenerator" has incompatible type "int"; expected "str"  If you come from the static typed language world this should look really normal to you, but for Python developers finding an error like this (in particular in large code bases) requires to spend quite a bit of time debugging (and sometimes the use of Voodoo magic). ## When to use mypy Optional type annotations are that, optional. You can start hacking as normal using the speed that Python dynamic typing gives you and once your code is stable enough you can gradually add type annotations to help avoid bugs and to document the code. The mypy FAQ contains some scenarios in which a project will benefit from using static type annotations: • Your project is large or complex. • Your codebase must be maintained for a long time. • Multiple developers are working on the same code. • Running tests takes a lot of time or work (type checking may help you find errors early in development, reducing the number of testing iterations). • Some project members (devs or management) don’t like dynamic typing, but others prefer dynamic typing and Python syntax. Mypy could be a solution that everybody finds easy to accept. • You want to future-proof your project even if currently none of the above really apply. In the particular case of my team, a lot of the code we write ends up running for quite a long time inside of MapReduce (Hadoop) jobs, so being able to detect bugs ahead of time would save precious developer time and make everyone happier. ## Adding support to Emacs By now you might be thinking that it would be cool to integrate mypy checks into your editor. Some, like PyCharm, already support this. For Emacs you can integrate mypy into Flycheck via flycheck-mypy. You can install it via M-x package-install flycheck-mypy. Configuring it is a matter of setting a couple of variables: (set-variable 'flycheck-python-mypy-executable "/Users/miguel/anaconda2/envs/py35/mypy/mypy") (set-variable 'flycheck-python-mypy-args '("--py2" "--ignore-missing-imports" "--check-untyped-defs"))  Mypy recommends disabling all other linters/checkers like flake8 and others when using it, however I wanted to keep both at the same time (call me paranoid). In Emacs, you can accomplish this with the following configuration: (flycheck-add-next-checker 'python-flake8 'python-mypy)  ## Final words and references Using mypy won't magically find errors in your code, it will be as good as the type annotations you add and the way you structure the code. Also, it is not a replacement for proper documentation. Sometimes there are methods/functions that become easier to read just by adding type annotations, but documenting key parts of the code is vital for ensuring code maintainability and extensibility. I did not mention all the features of mypy so please check official documentation to learn more. There are a couple of talks that can serve as a nice introduction to the topic: The first one of them is given by Guido, who's pushing the project a lot. Thus, I expect mypy to become more popular in the following years. Happy hacking. # This Site Runs on Nikola I have changed the site from Jekyll to Nikola. Mainly because I am mostly a Python coder nowadays and because the Jekyll version I was using it was kind of hacky and full of patches I created. So let's see if I finally get to write more often. I have been thinking on becoming an Iron Blogger but it might be too much. If I start blogging more using this setup I will definitively consider it. I really dig that Nikola offers good support of org-mode format (which I use to store my notes and other personal information). Actually, this post was written using org-mode from Emacs. It also has good support of blogging with Jupyter Notebooks, which I use a lot also at work. It also comes with capabilities for importing from systems like Wordpress or Jekyll. I do have to change Disqus URLs to match the new format. The rest of the post it is going to be me trying stuff and some features of Nikola and org-mode Emacs. for x in range(1, 10): print(x) print list(range(0, 9))  1 2 3 4 5 6 7 8 9 [0, 1, 2, 3, 4, 5, 6, 7, 8]  With org-babel I can run the code inside get the output. I find that pretty-neat when blogging and showing code snippets. A random graph, trying out how to use Nikola's short codes. #chart-202cbf41-1466-4017-802d-a3a451b74e66{-webkit-user-select:none;-webkit-font-smoothing:antialiased;font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .title{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:16px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .legends .legend text{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:14px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis text{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis text.major{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay text.value{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:16px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay text.label{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:14px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 text.no_data{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:64px} #chart-202cbf41-1466-4017-802d-a3a451b74e66{background-color:#f0f0f0}#chart-202cbf41-1466-4017-802d-a3a451b74e66 path,#chart-202cbf41-1466-4017-802d-a3a451b74e66 line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 rect,#chart-202cbf41-1466-4017-802d-a3a451b74e66 circle{-webkit-transition:250ms ease-in;-moz-transition:250ms ease-in;transition:250ms ease-in}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .graph > .background{fill:#f0f0f0}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .plot > .background{fill:#f8f8f8}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .graph{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 text.no_data{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .title{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .legends .legend text{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .legends .legend:hover text{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .line{stroke:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .guide.line{stroke:rgba(0,0,0,0.6)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .major.line{stroke:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis text.major{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y .guides:hover .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .line-graph .axis.x .guides:hover .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .stackedline-graph .axis.x .guides:hover .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .xy-graph .axis.x .guides:hover .guide.line{stroke:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .guides:hover text{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .reactive{fill-opacity:.5;stroke-opacity:.8}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .ci{stroke:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .reactive.active,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .active .reactive{fill-opacity:.9;stroke-opacity:.9;stroke-width:4}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .ci .reactive.active{stroke-width:1.5}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .series text{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip rect{fill:#f8f8f8;stroke:rgba(0,0,0,0.9);-webkit-transition:opacity 250ms ease-in;-moz-transition:opacity 250ms ease-in;transition:opacity 250ms ease-in}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .label{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .label{fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .legend{font-size:.8em;fill:rgba(0,0,0,0.6)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .x_label{font-size:.6em;fill:rgba(0,0,0,0.9)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .xlink{font-size:.5em;text-decoration:underline}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip .value{font-size:1.5em}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .bound{font-size:.5em}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .max-value{font-size:.75em;fill:rgba(0,0,0,0.6)}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .map-element{fill:#f8f8f8;stroke:rgba(0,0,0,0.6) !important}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .map-element .reactive{fill-opacity:inherit;stroke-opacity:inherit}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-0,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-0 a:visited{stroke:#00b2f0;fill:#00b2f0}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-1,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-1 a:visited{stroke:#43d9be;fill:#43d9be}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-2,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-2 a:visited{stroke:#0662ab;fill:#0662ab}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-3,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .color-3 a:visited{stroke:#00668a;fill:#00668a}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay .color-0 text{fill:black}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay .color-1 text{fill:black}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay .color-2 text{fill:black}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .text-overlay .color-3 text{fill:black} #chart-202cbf41-1466-4017-802d-a3a451b74e66 text.no_data{text-anchor:middle}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .guide.line{fill:none}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .centered{text-anchor:middle}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .title{text-anchor:middle}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .legends .legend text{fill-opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.x text{text-anchor:middle}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.x:not(.web) text[transform]{text-anchor:start}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.x:not(.web) text[transform].backwards{text-anchor:end}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y text{text-anchor:end}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y text[transform].backwards{text-anchor:start}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y2 text{text-anchor:start}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y2 text[transform].backwards{text-anchor:end}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .guide.line{stroke-dasharray:4,4}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .major.guide.line{stroke-dasharray:6,6}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .horizontal .axis.y .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .horizontal .axis.y2 .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .vertical .axis.x .guide.line{opacity:0}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .horizontal .axis.always_show .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .vertical .axis.always_show .guide.line{opacity:1 !important}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y .guides:hover .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.y2 .guides:hover .guide.line,#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis.x .guides:hover .guide.line{opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .axis .guides:hover text{opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .nofill{fill:none}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .subtle-fill{fill-opacity:.2}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .dot{stroke-width:1px;fill-opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .dot.active{stroke-width:5px}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .dot.negative{fill:transparent}#chart-202cbf41-1466-4017-802d-a3a451b74e66 text,#chart-202cbf41-1466-4017-802d-a3a451b74e66 tspan{stroke:none !important}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .series text.active{opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip rect{fill-opacity:.95;stroke-width:.5}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .tooltip text{fill-opacity:1}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .showable{visibility:hidden}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .showable.shown{visibility:visible}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .gauge-background{fill:rgba(229,229,229,1);stroke:none}#chart-202cbf41-1466-4017-802d-a3a451b74e66 .bg-lines{stroke:#f0f0f0;stroke-width:2px} window.pygal = window.pygal || {};window.pygal.config = window.pygal.config || {};window.pygal.config['202cbf41-1466-4017-802d-a3a451b74e66'] = {"formatter": null, "width": 800, "tooltip_fancy_mode": true, "style": {"background": "#f0f0f0", "major_guide_stroke_dasharray": "6,6", "opacity": ".5", "transition": "250ms ease-in", "opacity_hover": ".9", "stroke_opacity": ".8", "no_data_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "guide_stroke_dasharray": "4,4", "value_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "label_font_size": 10, "stroke_opacity_hover": ".9", "foreground_subtle": "rgba(0, 0, 0, 0.6)", "value_label_font_size": 10, "legend_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "title_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "ci_colors": [], "value_label_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "plot_background": "#f8f8f8", "legend_font_size": 14, "title_font_size": 16, "colors": ["#00b2f0", "#43d9be", "#0662ab", "#00668a", "#98eadb", "#97d959", "#033861", "#ffd541", "#7dcf30", "#3ecdff", "#daaa00"], "tooltip_font_size": 14, "tooltip_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "no_data_font_size": 64, "value_colors": [], "value_font_size": 16, "value_background": "rgba(229, 229, 229, 1)", "major_label_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "label_font_family": "Consolas, \"Liberation Mono\", Menlo, Courier, monospace", "foreground_strong": "rgba(0, 0, 0, 0.9)", "major_label_font_size": 10, "foreground": "rgba(0, 0, 0, 0.9)"}, "print_zeroes": true, "half_pie": false, "show_y_labels": true, "margin_bottom": null, "allow_interruptions": false, "print_labels": false, "y_labels": null, "legend_at_bottom": false, "stack_from_top": false, "max_scale": 16, "print_values": false, "title": "Browser usage evolution (in %)", "classes": ["pygal-chart"], "y_labels_major_every": null, "y_label_rotation": 0, "box_mode": "extremes", "margin": 20, "rounded_bars": null, "min_scale": 4, "zero": 0, "no_prefix": false, "y_labels_major": null, "dynamic_print_values": false, "x_labels": ["2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012"], "xrange": null, "x_labels_major_every": null, "x_label_rotation": 0, "disable_xml_declaration": false, "show_dots": true, "margin_left": null, "dots_size": 2.5, "explicit_size": false, "interpolation_parameters": {}, "css": ["file://style.css", "file://graph.css"], "force_uri_protocol": "https", "missing_value_fill_truncation": "x", "inner_radius": 0, "legend_at_bottom_columns": null, "stroke_style": null, "fill": false, "x_title": null, "order_min": null, "show_x_labels": true, "pretty_print": false, "stroke": true, "show_only_major_dots": false, "defs": [], "print_values_position": "center", "inverse_y_axis": false, "include_x_axis": false, "spacing": 10, "truncate_label": null, "tooltip_border_radius": 0, "y_labels_major_count": null, "show_x_guides": false, "margin_right": null, "x_labels_major_count": null, "interpolate": null, "show_y_guides": true, "range": null, "y_title": null, "interpolation_precision": 250, "js": ["//kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js"], "show_minor_y_labels": true, "strict": false, "logarithmic": false, "show_legend": true, "secondary_range": null, "height": 600, "margin_top": null, "show_minor_x_labels": true, "no_data_text": "No data", "legends": ["Firefox", "Chrome", "IE", "Others"], "x_labels_major": null, "truncate_legend": null, "legend_box_size": 12}Browser usage evolution (in %) 0010102020303040405050606070708080200220032004200520062007200820092010201120120137.29965034965036504.11538461538464200416.6194.74720279720282456.3051819974897200525252.19475524475524432.11206741975974200631309.6423076923077414.83127129280973200785.847.681468531468525257.0200284.6105.12902097902098260.45615922539200384.7162.57657342657345260.1681459566075200474.5220.0241258741259289.54549937242246200566277.4716783216783314.0266272189349200658.6334.9192307692308335.33960910883985200714.260.31993006993006463.2175004482697200215.4117.76748251748252459.7613412228797200315.3175.21503496503496460.049354491662220048.9232.66258741258744478.482203693742220059290.1101398601399478.19419042495963200610.4347.5576923076923474.16200466200472007Browser usage evolution (in %)FirefoxChromeIEOthers Short codes need to be surrounded by by #+BEGIN_EXAMPLE #+END_EXAMPLE  It took me a while to get used to Nikola style and properly configure it. I should have written the steps down but I forgot. However, there are plenty of resources. # Ha-doopla! Tool for displaying Python Hadoop Streaming Errors ## Introducing Doopla At TrustYou we use Luigi a lot for building our Data Pipelines, mostly made of batch Hadoop map reduce Job. We have a couple of clusters, one using a pretty old version of Hadoop and one more recent, where we use HDP 2.7. Writing Hadoop MR jobs in Python is quite nice, and it is even more straight forward using Luigi's support. The big issue is when you are developing and you have to debug. Although I try to decrease the amount of time of in-cluster debugging (by for example using domain classes and writing unit tests against them), sometimes you have no choice. And then the pain comes. One you Mapper or your reducer fails most of the times Luigi cannot show you the reason of the failure and you have to go to the web interface and manually click through many times until you sort of find the error message, with hopefully enough debugging information. So after debugging my MR jobs this way for a while I got really annoyed and decided to automate that part and I created Doopla , a small script that fetches the the output (generally stderr) of a failed mapper and / or reducer, and using Pygments highlights the failing Python code. It not jobid is specified if will fetch the output from the last failed job. It was a two hours hack at the beginning so it is not a code I am proud of so I made it public and even send it to Pypi (a chance to learn something new as well), so it can be installed easily by just writing pip install doopla. It initally only supported our old Hadoop version, but last one worked with HDP 2.7 (and I guess it might work for other Hadoop versin). New version of Hadoop offer an REST API for querying job status and information, but I kept scraping the information (hey, it is a hack). You can also integrate that in Emacs (supporting the highlighting and everything) with code like: And then hit M-x doopla to obtain the same without leaving your lovely editor. # Attending Europython 2016 I attended this year once again and it was pretty amazing. The weather did not help a lot but both the content and the people were really good. This time I gave a short talk about some of the things I leared in these couple of years using Python for data processing. The idea was to show a wide range of topics so the talk was not a hardcore technical talk but quite basic. There is also a video available for the curious. Overall a good experience, some really good talks, in particular the lighting talk show was really entertaining. KeyNotes were really well selected and the food was ok. I missed some of the nice sponsor from last year and Microsoft did not even show up. Next year is going to be in Milan, so I am really looking forward to the next year's Europython. # Running Luigi Hadoop JobTask in a Virtual Environment ## Virtual Environments and Hadoop Streaming If you are using Python to write Hadoop streaming job you might have experience the issues of keeping the nodes with the required packages. Furthermore, if you happen to have different set of jobs, workflows or pipelines that require different version of packages you might find yourself in not so conformable situation. A former work colleague wrote on how to aleviate this by using Python's virtual environments. So I am going to assume you quickly browse to the article and you are wondering how to do something similar but with Luigi. Before taling about Luigi, a summary of running streaming jobs with virtualenvs (without Luigi): Normally, if you don't need a virtualenv, you will write a Python script for the mapper and one for the reducer and assuming you have already the data you need to process on HDFS you will call it something like this: [mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py


So here mapper.py is the mapper and reducer.py is the reducer. Nothing new if you have used Python for Hadoop Streaming. So, let's asumme we want a particular module that is not installed at the system level on the nodes of the cluster. Say, spaCy:

[mc@host]$virtualenv-2.7 demoenv New python executable in demoenv/bin/python2.7 ... done. [mc@host]$ source demoenv/bin/activate
(demoenv)[mc@host]$pip install spacy Collecting spacy ... Successfully built plac semver Installing collected packages: cymem, preshed, murmurhash, thinc, semver, sputnik, cloudpickle, plac, spacy Successfully installed cloudpickle-0.2.1 cymem-1.31.1 murmurhash-0.26.3 plac-0.9.1 preshed-0.46.3 semver-2.4.1 spacy-0.100.6 sputnik-0.9.3 thinc-5.0.7 (demoenv)[mc@host]$ deactivate
[mc@host]$virtualenv-2.7 --relocatable demoenv cd demoenv; zip --quiet --recurse-paths ../demoenv.zip * hadoop fs -put -f demoenv.zip  I make the virtualenv relocatable so that it can be moved and both the binaries and libraries are referenced using relative paths. Bear that the documentation also says that this feature is experimental and has some caveats, so use it at your own risk. I also compress it and upload it to HDFS. Now to run it we need to do two thing, change the shebang of the script to point to the venv and point to the archive with -archives parameter when running the hadoop streaming job. Assuming we are creating a link to the archive with the name demoenv we change the beginning of mapper.py and reducer.py: #!./demoenv/bin/python import spacy ....  And then we execute: [mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py
-archives hdfs://[host]:8020/user/mc/demoenv.zip#demoenv


Note the archives parameter with the symlink. That symlink has to match the path specified on the shebang.

So far I have showed nothing new but a compressed version of Hennig's post. So far it was impossible to do something similar with Luigi unless you created a JobRunner by basically rewriting (i.e. copy and pasting) some of Luigi's code. So I decided to make a small contribution to Luigi that would allow me to implement something similar to the things described in the previous section.

With that in Luigi code, is easy to create a new base class that pull the virtual environment location from Luigi's configuration and set-up a runner that pass the parameter to add the archive in underlying Hadoop streaming call.

I created the VenvJobTask that read the virtual environment location from the configuration. It can be local or it can be located on HDFS. It overrides the job_runner method to setup properly the python executable path (so no shebang modification is needed in this case). It references a small custom runner class that changes the default behavior of DefaultHadoopJobRunner to pass the appropriate parameter.

At the time of writing, this change has not been yet released, so it will be probably part of Luigi 2.1.2. In the mean time, you can install Luigi directly from Github. I tested the above code on Python 2.7.6 and Hadoop 2.7.1 in Hortoworks HDP 2.3.

# Starting Over

The best way to avoid starting over is never to stop doing something. Easy to say, hard to accomplish. The good thing is that this is mostly a personal exercise as I don't have many reader and I generally put here only technical stuff. Anyways, a lot has changed since the last time I actually wrote something (besides a boring list of links, I will probably continue doing it though.).

• I moved to Berlin (it's been a year, amazing experience).
• I still work for TrustYou, but now remotely.
• I started a Medium blog, mostly in Spanish.
• I presented at PyData Berlin 2015. Met Awesome people. Good experience.
• I attended Europython 2016, my first Python event. Another good experience. Bilbao and the sourrounding area is very beautiful.
• I started playing my accordion more often. Now I take classes of the Colombian style via Skype with Luis Javier. An awesome teacher.
• I traveled to Colombia for a couple of months. I worked 6 weeks from there. (Does that make me a "Digital Nomad"?).
• I gave a talk in Colombia (in Spanish) about my work. Another highlight from the trip.
• I got into Ethnomusicology, in particular Gaita Music and traditional tropical rythms of Colombia. I still need to read some books of Peter Wade.
• I bought some Gaita Colombiana flutes (Kuisi in Kogi or Chuanas in Zenu language) from a good instrument maker in the proximity of Sahagún (Colombia). I plan to learn how to play them, at least a couple of songs.

Berlin is way different to Munich. Munich is clean, beautiful and organized. It is close to the Alps, and it is perfect for doing things outdoors, when the weather allows. Tech scene is OK in Munich but not that crazy. Although poorer, Berlin has a more dynamic Startup environment and way more events and things to do inside the city. It has lot of interest group both technology and hobbies. It is way more multicultural, so you can you events of many types. There huge scenes of different types of musics and art. In particular I am very interested in the Art and Tech, Music Technology and Maker scene and events of the city. There some things I miss about Munich, but I am pretty happy with the change. I think I will stay in this city for some years.

Regarding this blog, I am going to try to keep posting my link logs but I will also try to write more about what I do day to day, mostly technical stuff. I will write on the Medium blog in Spanish and when post are not so technical.

This blog post was in the making for a while. I think I got distracted with many things (including a move to Berlin, more on that later). I am also started writing some posts on Medium, I will continue using it mostly for non-technical stuff.

For now let just publish this:

## Burda Hackday - The Futute of Finance

I participated in the Burda Hackathon about the future of finance. Here's my report on the experience.

## Colombian Salaries Survey.

A bit of self-promotion here. Famous Colombian entrepreneur Alexander Torrenegra recently released the results from a Survey about the salaries of Colombian Software Developers. I wrote an aticle Medium Medium that uses the data in a slightly different way.

## Mining Massive Data Sets

This year I started some very interesting MOOCs, I couldn't complete all of them. However, I did finish Mining Massive Data Sets from Coursera and got the Certificate. I have to say that this is one of the best online course I have taken so far. I think the diversity and usefulness of the topics are high and made me look at topic I previously have seen in a different way. I can say the only thing I did not like from the course is Prof. Ullman. I mean he is great, but his videos in the course were the most "heavy" way less entertaining. However the overall course structure and videos are great. I would recommend this course to anyone wanting to deep dive into massive data processing. A background in machine

## What the F**K

I found this nice tool called thefuck that tries to match a wrong command to a correct one using your command history. Basically if after writing a wrong command line you write fuck it will try to automatically fix it.

## Pig DataFu

Lately I have been working with Pig. A Dataflow programming languages. Really useful for juggling with data. I have been writing UDFs and playing with some advanced functionality of it. At TrustYouwe use a library that I did not know before. It is called DataFu and it contains a lot of useful techniques.

## Postinternet

• I finally watche dthe GapMinder presentation. Really interesting how the world have changed and how we still hold sort of old world views.
• This Github repository repo contains interesting Notebooks, some of them extracted from nice books.
• O'Reilly Launched https://beta.oreilly.com/ideas

## Machine Learning / Data Science Links

A couple of links of interesing NLP related class from Standford NLP group (These are not MOOC, but some materials - maybe videos - are going to be posted online)

## Data Journalism

A really nice tutorial on how to do data journalism (in German though) http://howto.ddjdach.de/. On the right side there various bars with links to nice tools to help different stages fo the data journalism (and to some extend) data science process.

We culd discuss the whole day the ethics of downlaoding a song from Soundcloud. However, sometimes is what you want to do. I found a little free application for Mac written by a Swiss guy.

You can thank me later.

## Nu-Cumbia and Alternative-Folk

Talking of sound cloud I am creating a couple of list of music. One is from Cumbia remixes / convers of famous pop-songs (Yeah, that thing actually exists). The other is what I call Colombia Alternative-Folk. They are Colombian groups that play Colombian Folk music (there are many genres and subgenres) but with a twist.

## Delete Training whitespace on Emacs (for Python)

I had issues keeping my Python source codes without trailing whitespaces. After a bit research I came up with my own version (just for Python files, as it was affecting other modes like Deft/Org).

(defun py-delete-whitespace ()
(when (eq (buffer-local-value 'major-mode (current-buffer)) 'python-mode)
(delete-trailing-whitespace)))



Learning a bit of e-lisp every day.

## Spacemacs / Prelude Emacs Distributions

Apparently the last trend is to use neither Emacs nor Vim but a mixture of them. I found out a interesting Emacs distribution that aims to provide Vim-like functionality while keeping the Emacs power. I also found another interesting distribution called Prelude. I am always looking for ways to improve my workflow and Prelude seems to bring a batteries include distribution of Emacs. I am going to try it out both and decide whether to stay with my current Emacs configuration or switch to one of these distribution.

## Mapfrappe

I found a nice tool called MAPfrappe "compare" maps and create cartographic "mixtures".

Munich on Berlin

Berlin on Munich

## Open Data and Civic Apps

• I found a interesting company called Data Made, from Chicago. A civic technology company working in projects related to open data.
• They are also the creators of Dedupe, a Python project to perform data deduplication. It uses a similar approach to the one we use for hotel deduplication @TrustYou.

## Colombian Software Developers

Recently two interesting articles about software developers in Colombia were written:

The first one is from Alexander Torrenegra and the other was written by Juan Pablo Buritica. Juan Pablo somehow rebuts Alexander some of Alexander's claims. I think Both make interesting points and I think they have the experience and the authority to talk about the subject. I share a bit more the opinion of Alexander as I think Computer Science education in Colombia is not good and need an overhaul. I would love to write a small piece on my experience as I graduated from a top Colombian university and I believe the education on computer science related topics is precarious.

## Social Network Extravaganza

Thanks to the Mining Masive Data Sets class on Coursera and that I started getting curious curious about it. I wanted to explore a bit more and I downloaded the my friends dataset from Facebook and imported into Gephi, a tool for Graph visualization and analysis.

Gephi interface.

Below a couple of interesting links for people that want to do the same:

### Install Gephi on Mac OSX Yosemite

There some issues with installing Gephi on the last version of OSX. This helped me solving the issue. http://sumnous.github.io/blog/2014/07/24/gephi-on-mac/

I nice tool that helped me to extract my Facebook friend graph in a Gephi compatible file format. http://snacourse.com/getnet

## Storify

https://storify.com/ I found this tool to be really nice to keep a log of event and add social network based stories.

## SimString

http://www.chokkan.org/software/simstring/ I have been taken the MDDS class on Coursera. Challenging but super interesting. I wanted to use LSH to do a sort of hashing of near duplicate short sentences. It is a small dataset so I believe this algorithm will work actually better.

## Delete Training whitespace on Save and compact Empty Line on Emacs

http://ergoemacs.org/emacs/elisp_compact_empty_lines.html Awesome snippet, specially for Python development on Emacs

## Intro

As it generally takes a lot of time to write a proper blog, I have decided to give up to my laziness and start posting more often by simply sharing links to things I found during the week / month. It will also help me keeping track of links that find interesting.

## Syncing automatically files using Rsync and iotifywatch / fswatch

Although Emacs has the power to connect directly to servers through SSH, I do like to have my code locally first and them uploaded to a server, for example to run a Hadoop job or some machine learning algorithm. For that Rsync is really nice and I initially wrote a small function that I assigned to a keyboard shortcut:

(defun sync-sentency ()
(interactive)
(shrink-window-if-larger-than-buffer)
(shell-command "cd ~/development ; rsync --exclude='lib/stanford-ner' --exclude=files  -az --progress sentency  vmx:sentency & " ))
(global-set-key (kbd "C-c C-y ") 'sync-sentency)


But then I grew tired or hitting that command always so I did a bit a research and found a couple of commands: fswatch and inotifywait. The first is available for many *NIX systems. Below a couple of interesting links and scripts:

I ended up modifying a existing project to adapt it to my needs and you can find the code here. It basically check for a change on a file and uploads it accordingly using rsync. So no more hitting shortcuts. https://github.com/mfcabrera/fswatch-rsync

## Elpy, PEP8 and friends

I thought my coding style in Python was pretty good. Then I started using Pep8 and friends to check my code and I realized I sucked big time. Good thing is that thanks to modes like Elpy (Nice Python mode for Mac) I can check my code while I write, and I can learn about the proper style. Hopefully that will help me become a better Pythonista (or Pythoneer?). A nice tutorial can be found here.

## Scratch Programming Language

I took a look at the Scratch programming language and went through a couple of tutorials. Really intuitive. I am thinking about teaching it to my niece. It has a large community and the documentation is in many languages and a recently released MOOC.

## Github aliases

Although for Git I use the super cool Magit (mode for Emacs), sometimes I like to use the the command line directly. I started checking At TrustYou we use Git, although not as professionally as we would like. I found a nice post featuring a lot of useful Github aliases from one of the developers of Github: http://haacked.com/archive/2014/07/28/github-flow-aliases/

## Working for Open Data Notebooks

While looking on information on how to integrate JS and IPython Notebook I stumbled upon the files for the class Working for Open Data. There are no videos of the class but the slides, the readings and the IPython notebook are really nice.

## Share You Stack

Interesting projects where you share your technology stack: http://stackshare.io/stacks

## MOOCS

Some MOOCs I am trying to follow (in that order):

There are some MOOCs running that although I find them really interesting, I have no time to follow:

I actually took the last one, but I did not complete all homeworks, really nice overview of music production.

Upcoming MOOCS that I find attractive and I am going to try to follow: