Ha-doopla! Tool for displaying Python Hadoop Streaming Errors

Introducing Doopla

At TrustYou we use Luigi a lot for building our Data Pipelines, mostly made of batch Hadoop map reduce Job. We have a couple of clusters, one using a pretty old version of Hadoop and one more recent, where we use HDP 2.7.

Writing Hadoop MR jobs in Python is quite nice, and it is even more straight forward using Luigi's support. The big issue is when you are developing and you have to debug. Although I try to decrease the amount of time of in-cluster debugging (by for example using domain classes and writing unit tests against them), sometimes you have no choice.

And then the pain comes. One you Mapper or your reducer fails most of the times Luigi cannot show you the reason of the failure and you have to go to the web interface and manually click through many times until you sort of find the error message, with hopefully enough debugging information.

So after debugging my MR jobs this way for a while I got really annoyed and decided to automate that part and I created Doopla , a small script that fetches the the output (generally stderr) of a failed mapper and / or reducer, and using Pygments highlights the failing Python code. It not jobid is specified if will fetch the output from the last failed job. It was a two hours hack at the beginning so it is not a code I am proud of so I made it public and even send it to Pypi (a chance to learn something new as well), so it can be installed easily by just writing pip install doopla.

It initally only supported our old Hadoop version, but last one worked with HDP 2.7 (and I guess it might work for other Hadoop versin). New version of Hadoop offer an REST API for querying job status and information, but I kept scraping the information (hey, it is a hack).

You can also integrate that in Emacs (supporting the highlighting and everything) with code like:

And then hit M-x doopla to obtain the same without leaving your lovely editor.

Comments

Comments powered by Disqus