This Site Runs on Nikola

I have changed the site from Jekyll to Nikola. Mainly because I am mostly a Python coder nowadays and because the Jekyll version I was using it was kind of hacky and full of patches I created. So let's see if I finally get to write more often. I have been thinking on becoming an Iron Blogger but it might be too much. If I start blogging more using this setup I will definitively consider it.

I really dig that Nikola offers good support of org-mode format (which I use to store my notes and other personal information). Actually, this post was written using org-mode from Emacs. It also has good support of blogging with Jupyter Notebooks, which I use a lot also at work. It also comes with capabilities for importing from systems like Wordpress or Jekyll. I do have to change Disqus URLs to match the new format.

The rest of the post it is going to be me trying stuff and some features of Nikola and org-mode Emacs.

for x in range(1, 10):
    print(x)
print list(range(0, 9))
1
2
3
4
5
6
7
8
9
[0, 1, 2, 3, 4, 5, 6, 7, 8]

With org-babel I can run the code inside get the output. I find that pretty-neat when blogging and showing code snippets.

A random graph, trying out how to use Nikola's short codes.


Browser usage evolution (in %)
0010102020303040405050606070708080200220032004200520062007200820092010201120120137.29965034965036504.11538461538464200416.6194.74720279720282456.3051819974897200525252.19475524475524432.11206741975974200631309.6423076923077414.83127129280973200785.847.681468531468525257.0200284.6105.12902097902098260.45615922539200384.7162.57657342657345260.1681459566075200474.5220.0241258741259289.54549937242246200566277.4716783216783314.0266272189349200658.6334.9192307692308335.33960910883985200714.260.31993006993006463.2175004482697200215.4117.76748251748252459.7613412228797200315.3175.21503496503496460.049354491662220048.9232.66258741258744478.482203693742220059290.1101398601399478.19419042495963200610.4347.5576923076923474.16200466200472007Browser usage evolution (in %)FirefoxChromeIEOthers

Short codes need to be surrounded by by

#+BEGIN_EXAMPLE

#+END_EXAMPLE

It took me a while to get used to Nikola style and properly configure it. I should have written the steps down but I forgot. However, there are plenty of resources.

Ha-doopla! Tool for displaying Python Hadoop Streaming Errors

Introducing Doopla

At TrustYou we use Luigi a lot for building our Data Pipelines, mostly made of batch Hadoop map reduce Job. We have a couple of clusters, one using a pretty old version of Hadoop and one more recent, where we use HDP 2.7.

Writing Hadoop MR jobs in Python is quite nice, and it is even more straight forward using Luigi's support. The big issue is when you are developing and you have to debug. Although I try to decrease the amount of time of in-cluster debugging (by for example using domain classes and writing unit tests against them), sometimes you have no choice.

And then the pain comes. One you Mapper or your reducer fails most of the times Luigi cannot show you the reason of the failure and you have to go to the web interface and manually click through many times until you sort of find the error message, with hopefully enough debugging information.

So after debugging my MR jobs this way for a while I got really annoyed and decided to automate that part and I created Doopla , a small script that fetches the the output (generally stderr) of a failed mapper and / or reducer, and using Pygments highlights the failing Python code. It not jobid is specified if will fetch the output from the last failed job. It was a two hours hack at the beginning so it is not a code I am proud of so I made it public and even send it to Pypi (a chance to learn something new as well), so it can be installed easily by just writing pip install doopla.

It initally only supported our old Hadoop version, but last one worked with HDP 2.7 (and I guess it might work for other Hadoop versin). New version of Hadoop offer an REST API for querying job status and information, but I kept scraping the information (hey, it is a hack).

You can also integrate that in Emacs (supporting the highlighting and everything) with code like:

And then hit M-x doopla to obtain the same without leaving your lovely editor.

Attending Europython 2016

I attended this year once again and it was pretty amazing. The weather did not help a lot but both the content and the people were really good. This time I gave a short talk about some of the things I leared in these couple of years using Python for data processing. The idea was to show a wide range of topics so the talk was not a hardcore technical talk but quite basic.

There is also a video available for the curious.

Overall a good experience, some really good talks, in particular the lighting talk show was really entertaining. KeyNotes were really well selected and the food was ok. I missed some of the nice sponsor from last year and Microsoft did not even show up.

Next year is going to be in Milan, so I am really looking forward to the next year's Europython.

Running Luigi Hadoop JobTask in a Virtual Environment

Virtual Environments and Hadoop Streaming

If you are using Python to write Hadoop streaming job you might have experience the issues of keeping the nodes with the required packages. Furthermore, if you happen to have different set of jobs, workflows or pipelines that require different version of packages you might find yourself in not so conformable situation.

A former work colleague wrote on how to aleviate this by using Python's virtual environments. So I am going to assume you quickly browse to the article and you are wondering how to do something similar but with Luigi.

Before taling about Luigi, a summary of running streaming jobs with virtualenvs (without Luigi):

Normally, if you don't need a virtualenv, you will write a Python script for the mapper and one for the reducer and assuming you have already the data you need to process on HDFS you will call it something like this:

[mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py

So here mapper.py is the mapper and reducer.py is the reducer. Nothing new if you have used Python for Hadoop Streaming. So, let's asumme we want a particular module that is not installed at the system level on the nodes of the cluster. Say, spaCy:

[mc@host]$ virtualenv-2.7 demoenv
New python executable in demoenv/bin/python2.7 ... done.
[mc@host]$ source demoenv/bin/activate
(demoenv)[mc@host]$ pip install spacy
Collecting spacy
...
Successfully built plac semver
Installing collected packages: cymem, preshed, murmurhash, thinc, semver, sputnik, cloudpickle, plac, spacy
Successfully installed cloudpickle-0.2.1 cymem-1.31.1 murmurhash-0.26.3 plac-0.9.1 preshed-0.46.3 semver-2.4.1 spacy-0.100.6 sputnik-0.9.3 thinc-5.0.7
(demoenv)[mc@host]$ deactivate
[mc@host]$ virtualenv-2.7 --relocatable demoenv
cd demoenv; zip --quiet --recurse-paths ../demoenv.zip *
hadoop fs -put -f demoenv.zip

I make the virtualenv relocatable so that it can be moved and both the binaries and libraries are referenced using relative paths. Bear that the documentation also says that this feature is experimental and has some caveats, so use it at your own risk. I also compress it and upload it to HDFS.

Now to run it we need to do two thing, change the shebang of the script to point to the venv and point to the archive with -archives parameter when running the hadoop streaming job. Assuming we are creating a link to the archive with the name demoenv we change the beginning of mapper.py and reducer.py:

#!./demoenv/bin/python

import spacy
....

And then we execute:

[mc@host]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.job.name="My Cool MR Job"
-files reducer.py, mapper.py
-mapper mapper.py
-reducer reducer.py
-archives hdfs://[host]:8020/user/mc/demoenv.zip#demoenv

Note the archives parameter with the symlink. That symlink has to match the path specified on the shebang.

Running Luigi HadoopJobTask in a Python Venv

So far I have showed nothing new but a compressed version of Hennig's post. So far it was impossible to do something similar with Luigi unless you created a JobRunner by basically rewriting (i.e. copy and pasting) some of Luigi's code. So I decided to make a small contribution to Luigi that would allow me to implement something similar to the things described in the previous section.

With that in Luigi code, is easy to create a new base class that pull the virtual environment location from Luigi's configuration and set-up a runner that pass the parameter to add the archive in underlying Hadoop streaming call.

I created the VenvJobTask that read the virtual environment location from the configuration. It can be local or it can be located on HDFS. It overrides the job_runner method to setup properly the python executable path (so no shebang modification is needed in this case). It references a small custom runner class that changes the default behavior of DefaultHadoopJobRunner to pass the appropriate parameter.

At the time of writing, this change has not been yet released, so it will be probably part of Luigi 2.1.2. In the mean time, you can install Luigi directly from Github. I tested the above code on Python 2.7.6 and Hadoop 2.7.1 in Hortoworks HDP 2.3.

Starting Over

The best way to avoid starting over is never to stop doing something. Easy to say, hard to accomplish. The good thing is that this is mostly a personal exercise as I don't have many reader and I generally put here only technical stuff. Anyways, a lot has changed since the last time I actually wrote something (besides a boring list of links, I will probably continue doing it though.).

  • I moved to Berlin (it's been a year, amazing experience).
  • I still work for TrustYou, but now remotely.
  • I started a Medium blog, mostly in Spanish.
  • I presented at PyData Berlin 2015. Met Awesome people. Good experience.
  • I attended Europython 2016, my first Python event. Another good experience. Bilbao and the sourrounding area is very beautiful.
  • I started playing my accordion more often. Now I take classes of the Colombian style via Skype with Luis Javier. An awesome teacher.
  • I traveled to Colombia for a couple of months. I worked 6 weeks from there. (Does that make me a "Digital Nomad"?).
  • I gave a talk in Colombia (in Spanish) about my work. Another highlight from the trip.
  • I got into Ethnomusicology, in particular Gaita Music and traditional tropical rythms of Colombia. I still need to read some books of Peter Wade.
  • I bought some Gaita Colombiana flutes (Kuisi in Kogi or Chuanas in Zenu language) from a good instrument maker in the proximity of Sahagún (Colombia). I plan to learn how to play them, at least a couple of songs.

Berlin is way different to Munich. Munich is clean, beautiful and organized. It is close to the Alps, and it is perfect for doing things outdoors, when the weather allows. Tech scene is OK in Munich but not that crazy. Although poorer, Berlin has a more dynamic Startup environment and way more events and things to do inside the city. It has lot of interest group both technology and hobbies. It is way more multicultural, so you can you events of many types. There huge scenes of different types of musics and art. In particular I am very interested in the Art and Tech, Music Technology and Maker scene and events of the city. There some things I miss about Munich, but I am pretty happy with the change. I think I will stay in this city for some years.

Regarding this blog, I am going to try to keep posting my link logs but I will also try to write more about what I do day to day, mostly technical stuff. I will write on the Medium blog in Spanish and when post are not so technical.

Linklog - Week 20

This blog post was in the making for a while. I think I got distracted with many things (including a move to Berlin, more on that later). I am also started writing some posts on Medium, I will continue using it mostly for non-technical stuff.

For now let just publish this:

Burda Hackday - The Futute of Finance

I participated in the Burda Hackathon about the future of finance. Here's my report on the experience.

Colombian Salaries Survey.

A bit of self-promotion here. Famous Colombian entrepreneur Alexander Torrenegra recently released the results from a Survey about the salaries of Colombian Software Developers. I wrote an aticle Medium Medium that uses the data in a slightly different way.

Mining Massive Data Sets

This year I started some very interesting MOOCs, I couldn't complete all of them. However, I did finish Mining Massive Data Sets from Coursera and got the Certificate. I have to say that this is one of the best online course I have taken so far. I think the diversity and usefulness of the topics are high and made me look at topic I previously have seen in a different way. I can say the only thing I did not like from the course is Prof. Ullman. I mean he is great, but his videos in the course were the most "heavy" way less entertaining. However the overall course structure and videos are great. I would recommend this course to anyone wanting to deep dive into massive data processing. A background in machine

What the F**K

I found this nice tool called thefuck that tries to match a wrong command to a correct one using your command history. Basically if after writing a wrong command line you write fuck it will try to automatically fix it.

Pig DataFu

Lately I have been working with Pig. A Dataflow programming languages. Really useful for juggling with data. I have been writing UDFs and playing with some advanced functionality of it. At TrustYouwe use a library that I did not know before. It is called DataFu and it contains a lot of useful techniques.

Random Links

  • I finally watche dthe GapMinder presentation. Really interesting how the world have changed and how we still hold sort of old world views.
  • This Github repository repo contains interesting Notebooks, some of them extracted from nice books.
  • O'Reilly Launched https://beta.oreilly.com/ideas

Linklog - Week 11

Machine Learning / Data Science Links

A couple of links of interesing NLP related class from Standford NLP group (These are not MOOC, but some materials - maybe videos - are going to be posted online)

Other random links:

Data Journalism

A really nice tutorial on how to do data journalism (in German though) http://howto.ddjdach.de/. On the right side there various bars with links to nice tools to help different stages fo the data journalism (and to some extend) data science process.

Download Music from Soundcloud

We culd discuss the whole day the ethics of downlaoding a song from Soundcloud. However, sometimes is what you want to do. I found a little free application for Mac written by a Swiss guy.

http://black-burn.ch/scd/

You can thank me later.

Nu-Cumbia and Alternative-Folk

Talking of sound cloud I am creating a couple of list of music. One is from Cumbia remixes / convers of famous pop-songs (Yeah, that thing actually exists). The other is what I call Colombia Alternative-Folk. They are Colombian groups that play Colombian Folk music (there are many genres and subgenres) but with a twist.

Delete Training whitespace on Emacs (for Python)

I had issues keeping my Python source codes without trailing whitespaces. After a bit research I came up with my own version (just for Python files, as it was affecting other modes like Deft/Org).

(defun py-delete-whitespace ()
          (when (eq (buffer-local-value 'major-mode (current-buffer)) 'python-mode)
           (delete-trailing-whitespace)))

(add-hook 'before-save-hook 'py-delete-whitespace)

Learning a bit of e-lisp every day.

Spacemacs / Prelude Emacs Distributions

Apparently the last trend is to use neither Emacs nor Vim but a mixture of them. I found out a interesting Emacs distribution that aims to provide Vim-like functionality while keeping the Emacs power. I also found another interesting distribution called Prelude. I am always looking for ways to improve my workflow and Prelude seems to bring a batteries include distribution of Emacs. I am going to try it out both and decide whether to stay with my current Emacs configuration or switch to one of these distribution.

Mapfrappe

I found a nice tool called MAPfrappe "compare" maps and create cartographic "mixtures".

Munich on Berlin

Munich on Berlin

Berlin on Munich

Berlin on Munich

Linklog - Week 9

Open Data and Civic Apps

  • I found a interesting company called Data Made, from Chicago. A civic technology company working in projects related to open data.
  • They are also the creators of Dedupe, a Python project to perform data deduplication. It uses a similar approach to the one we use for hotel deduplication @TrustYou.

Colombian Software Developers

Recently two interesting articles about software developers in Colombia were written:

The first one is from Alexander Torrenegra and the other was written by Juan Pablo Buritica. Juan Pablo somehow rebuts Alexander some of Alexander's claims. I think Both make interesting points and I think they have the experience and the authority to talk about the subject. I share a bit more the opinion of Alexander as I think Computer Science education in Colombia is not good and need an overhaul. I would love to write a small piece on my experience as I graduated from a top Colombian university and I believe the education on computer science related topics is precarious.

Social Network Extravaganza

Thanks to the Mining Masive Data Sets class on Coursera and that I started getting curious curious about it. I wanted to explore a bit more and I downloaded the my friends dataset from Facebook and imported into Gephi, a tool for Graph visualization and analysis.

Gephi interface.

Below a couple of interesting links for people that want to do the same:

Install Gephi on Mac OSX Yosemite

There some issues with installing Gephi on the last version of OSX. This helped me solving the issue. http://sumnous.github.io/blog/2014/07/24/gephi-on-mac/

Extract your Friend Network in Facebook

I nice tool that helped me to extract my Facebook friend graph in a Gephi compatible file format. http://snacourse.com/getnet

Storify

https://storify.com/ I found this tool to be really nice to keep a log of event and add social network based stories.

SimString

http://www.chokkan.org/software/simstring/ I have been taken the MDDS class on Coursera. Challenging but super interesting. I wanted to use LSH to do a sort of hashing of near duplicate short sentences. It is a small dataset so I believe this algorithm will work actually better.

Delete Training whitespace on Save and compact Empty Line on Emacs

http://ergoemacs.org/emacs/elisp_compact_empty_lines.html Awesome snippet, specially for Python development on Emacs

Linklog - Feb 2015

Intro

As it generally takes a lot of time to write a proper blog, I have decided to give up to my laziness and start posting more often by simply sharing links to things I found during the week / month. It will also help me keeping track of links that find interesting.

Syncing automatically files using Rsync and iotifywatch / fswatch

Although Emacs has the power to connect directly to servers through SSH, I do like to have my code locally first and them uploaded to a server, for example to run a Hadoop job or some machine learning algorithm. For that Rsync is really nice and I initially wrote a small function that I assigned to a keyboard shortcut:

(defun sync-sentency ()
  (interactive)
  (shrink-window-if-larger-than-buffer)
  (shell-command "cd ~/development ; rsync --exclude='lib/stanford-ner' --exclude=files  -az --progress sentency  vmx:sentency & " ))
(global-set-key (kbd "C-c C-y ") 'sync-sentency)

But then I grew tired or hitting that command always so I did a bit a research and found a couple of commands: fswatch and inotifywait. The first is available for many *NIX systems. Below a couple of interesting links and scripts:

I ended up modifying a existing project to adapt it to my needs and you can find the code here. It basically check for a change on a file and uploads it accordingly using rsync. So no more hitting shortcuts. https://github.com/mfcabrera/fswatch-rsync

Elpy, PEP8 and friends

I thought my coding style in Python was pretty good. Then I started using Pep8 and friends to check my code and I realized I sucked big time. Good thing is that thanks to modes like Elpy (Nice Python mode for Mac) I can check my code while I write, and I can learn about the proper style. Hopefully that will help me become a better Pythonista (or Pythoneer?). A nice tutorial can be found here.

Scratch Programming Language

I took a look at the Scratch programming language and went through a couple of tutorials. Really intuitive. I am thinking about teaching it to my niece. It has a large community and the documentation is in many languages and a recently released MOOC.

Github aliases

Although for Git I use the super cool Magit (mode for Emacs), sometimes I like to use the the command line directly. I started checking At TrustYou we use Git, although not as professionally as we would like. I found a nice post featuring a lot of useful Github aliases from one of the developers of Github: http://haacked.com/archive/2014/07/28/github-flow-aliases/

Working for Open Data Notebooks

While looking on information on how to integrate JS and IPython Notebook I stumbled upon the files for the class Working for Open Data. There are no videos of the class but the slides, the readings and the IPython notebook are really nice.

Share You Stack

Interesting projects where you share your technology stack: http://stackshare.io/stacks

MOOCS

Some MOOCs I am trying to follow (in that order):

There are some MOOCs running that although I find them really interesting, I have no time to follow:

I actually took the last one, but I did not complete all homeworks, really nice overview of music production.

Upcoming MOOCS that I find attractive and I am going to try to follow:

Moving to Berlin with the help of IPython and friends

Berlin Skyline

TL;DR

I help my girlfriend look for a flat using (I)Python and friends. I plot in a map apartments that match her criteria along with the time it takes to reach her workplace using public transportaton. I showcase some Python libraries like Pandas and Scrappy along with some features of IPython notebook to work with Google Maps API. Code and notebooks available on Github.

Intro

Moving to a new city is not an easy task. Among all the things, one of the most time consuming is finding a place to live. It is not easy, there are many variables to take into account and if you don't use an agency looking for one might be boring and repetitive.

Assuming you use the web to get some possible apartments, once you find a good candidate, you generally have to check the address, check the surroundings (e.g. Stores, Cafes) and which public transportation is available. You generally also have to check how long it will take you to get to your working place or the city center, either by car or public transportation. This is important, as Stutzer and Frey found "that a person with a one-hour commute has to earn 515 Euro more (or 40% of an average monthly wage in Germany) to compensate for the dissatisfaction caused by their long commute" [source].

If you use Google or specialized websites like ImmobilienScout24 (in Germany), you probably have to go through process of searching for it, checking wheter that apartment matches your criteria (i.e. number of rooms, size, rent price, etc). In addition to that, you have to check how far or how much time will you need to get to work.

There is actually a nice tool written by a Berliner that can help you with the last part called Mapficient. Mapficient can show you graphically areas you can reach with public transport in a given time and it is available for many cities. However, in order to use the tool you have to add the latitud and longitude coordinates manually for each of the candidate apartments.

That is the problem that my girlfriend is facing. She is moving to Berlin next month and she wants an apartment that matches her criteria and from where she could reach her working place in the shortest possible amount of time. So I decided to help her (us?) a bit with some assistance of Python/IPython and some services.

Since I read Karim's blog post and attended his presentation at the Munich Datageeks Meetup, I got interested in how to harness open data to automate or improve otherwise boring and time-consuming tasks.

I also googled a little bit before coding and stumpled upon a nice article by Robin Clarke, a guy living in Munich, and how he looked for an area in the city where he could reach the center of Munich in a specific time. He even built a super duper visualization that you can see below:

In [49]:
from IPython.display import HTML
HTML('<iframe src="https://www.google.com/fusiontables/embedviz?viz=MAP&q=select+col1+from+2304677+&h=false \
     &lat=48.19187395469069&lng=11.499547000000007&z=10&t=1&l=col1" width=800 height=400></iframe>')
Out[49]:

The lighter the area the less time you need from that location to reach the Munich's city center. In theory you could calculate something similar from any point to another arbitrary point in a city (and that's what Mapficient does), but I did not want to do anything that complex as I like more Street Fighting Data Science.

However this gave me an Idea: Why don't I plot in a Google Map only the apartments that have the characteristics I want (she wants) along with the time it takes to get to my girlfriend's workplace? I don't know the Google Map API nor Javascript but it can't be that hard.

Getting the Data

Here comes Web Scraping handy. Althought I had never used it, I knew there was a popular framework for Python called Scrapy. This was a nice opportunity to learn a bit about it. I wrote a small python project that scraps ImmoScout24 listings and stores the results in a JSON file. Before doing that, it uses Google Map services to geocode the address and calculate the distance to my girlfriend's workplace using public transportation. To do that, I use what Scrapy calls an ItemPipeline and Google Maps services client. I only limited it to Kreuzberg, Schoenegerberg and Charlottenburg as they still close to the city center but are still in the direction of her workplace.

The class that actually does the magic looks like this (You can find the full code along with this notebook on Github. )

In [16]:
class AddDistanceToMPIPipeline(object):

    latlong_mpi = str((52.444311, 13.273748))

    def __init__(self, ):
        self.gm_client = googlemaps.Client("_PUT_API_KEY_HERE")

    def process_item(self, item, spider):
        orig = item["addr"]
        geoloc = self.gm_client.geocode(orig)

        if len(geoloc) > 0:
            for k in ('lat', 'lng'):
                item[k] = geoloc[0]['geometry']['location'][k]

        directions_result = self.gm_client.directions(str((item['lat'], item['lng'])),
                                                      self.latlong_mpi,
                                                      mode="transit",
                                                      departure_time=1421307820)

        #  Pick the fastest way
        chosen_leg = None
        if len(directions_result) > 0:
            for dr in directions_result:
                for l in dr["legs"]:
                    if chosen_leg is None:
                        chosen_leg = l
                    if chosen_leg is not None and \
                       chosen_leg["duration"]["value"] > l["duration"]["value"]:
                        chosen_leg = l

        if chosen_leg is None:
            return
        item["time_to"] = chosen_leg["duration"]["value"]/60.0
        return item

Taking a look at the data

So after scraping the website for a while we have a file with all apartments available. We can use Pandas to load the data and take a look at it:

In [33]:
import pandas

with open('../ichbineinberliner/items.json') as f:
    data =  pandas.io.json.read_json(f)

Pandas also can do some SQL like filtering of the data. So let's assume my girlfriend wants an 3-room apartment (in Germany the living room is counted as one Zimmer (room)). She also wants to be able to get to her job in less than 40 minutes and the monthly rent should be less than 800 euros.

In [52]:
apartments =  data[data.zimmer == 3][data.miete <= 800][data.time_to <= 40].sort('time_to')
apartments[['addr', 'link', 'sqm', 'time_to', 'zimmer']].head(10)
Out[52]:
addr link sqm time_to zimmer
541 Schöneberg (Schöneberg), 12157 Berlin http://www.immobilienscout24.de/expose/78832352 98.74 26.716667 3
524 Dominicusstraße 40, Schöneberg (Schöneberg), 1... http://www.immobilienscout24.de/expose/78752552 86.37 28.916667 3
581 Sybelstraße 17, Charlottenburg (Charlottenburg... http://www.immobilienscout24.de/expose/78081827 72.25 29.533333 3
582 Ebersstrasse 15, Schöneberg (Schöneberg), 1082... http://www.immobilienscout24.de/expose/78826435 76.00 29.900000 3
301 Charlottenburg (Charlottenburg), 10629 Berlin http://www.immobilienscout24.de/expose/76914555 79.00 31.233333 3
511 Charlottenburg (Charlottenburg), 10625 Berlin http://www.immobilienscout24.de/expose/77718892 62.00 33.750000 3
489 Dernburgstr. 43, Charlottenburg (Charlottenbur... http://www.immobilienscout24.de/expose/77665277 70.00 34.900000 3
636 Sachsendamm 78, Schöneberg (Schöneberg), 10829... http://www.immobilienscout24.de/expose/78941951 73.34 35.516667 3
20 Otto-Suhr-Allee , Charlottenburg (Charlottenbu... http://www.immobilienscout24.de/expose/56455442 69.00 35.750000 3
15 Olbersstr. 2, Charlottenburg (Charlottenburg),... http://www.immobilienscout24.de/expose/59454790 60.00 36.233333 3

Visualizing the Data

So now we have some apartments that match our criteria. At this point I could send me the table above via email, and make the crawling run each day so I get notified of new available apartments. But I really wanted to visualize it in a better way.

I discovered that IPython notebook can embed and execute Javascript and HTML, thus embedding a Google Map in a cell is possible. The notebooks from the class Working with Open Data of the UC Berkely helped me to get started. Doing this is not that simple (a better support should be possible) but it is not hard either.

The first thing is to initialize the Google Maps API:

In [35]:
from IPython.core.display import HTML, Javascript
def gmap_init():
    js = """
window.gmap_initialize = function() {};
$.getScript('https://maps.googleapis.com/maps/api/js?v=3&sensor=false&callback=gmap_initialize');
"""
    return Javascript(data=js)
gmap_init()
Out[35]:

Then we declare the properties of the div where we are going to displaye the map:

In [36]:
%%html
<style type="text/css">
  .map-canvas { height: 400px; }
</style

Rendering the Map

Now comes the part were we generate the map. What we are going to do is to generate the Javascript code that renders the map. Then we can either display it in a cell using the IPython notebook HTML object or just store in a html file and upload it somewhere.

I created the small function below that generates the image (check the code comments for more info):

In [37]:
from IPython.core.display import HTML, Javascript

def map_pos_apartments(apartments, display=True, lat=52.4798023, lng=13.3563576, zoom=12):

    div_id = "miete" # name of the div where are we are going to display the map.
    html = """<div id="%s" class="map-canvas"/>""" % (div_id)


    # This is a template for the infobox that we are going to present to the user when he clicks a 
    # Marker
    content_template = """'<ul style="list-style: none;padding:0; margin:0;">' + 
    '<li> <a href="{link}" target="_blank"> {addr} </a></li>' +
    '<li><b>Time to MPI</b>: {time_to:.2f} min</b> </li><b>Size:</b> {sqm} m<sup>2</sup><li></li>' +
    '<li><b>Rent:</b> &#8364; {miete}</li></ul> '
    
    """
    # This is the template for a Marker on the map.  It also contains the code for generating the "Infowindow"
    # That appears when clicked. 
    marker_template = """
        var myLatlng = new google.maps.LatLng({lat},{lng});
        var marker_{i} = new google.maps.Marker({% raw  %}{{ {% endraw %}

        position: myLatlng,
        map: map,
        title:"{title}"
        }});
    
         var contentString = {content};

          var infowindow_{i} = new google.maps.InfoWindow({% raw  %}{{ {% endraw %}

          content: contentString
          }});
    
          google.maps.event.addListener(marker_{i}, 'click', function() {% raw  %}{{ {% endraw %}

            infowindow_{i}.open(map,marker_{i});
            if (lastWindow) {% raw  %}{{ {% endraw %}

                lastWindow.close();
            }}
            lastWindow = infowindow_{i}
      }});
    
    """
    ## JS intitalization code.
    js_init = """
    <script type="text/Javascript">
      (function(){
        var mapOptions = {
            zoom: %s,
            center: new google.maps.LatLng(%s, %s)
          };

        var map = new google.maps.Map(document.getElementById('%s'),
              mapOptions);
              
        var lastWindow = false;
        
        var transitLayer = new google.maps.TransitLayer();
        transitLayer.setMap(map);
              
              """ % (zoom, lat, lng, div_id)

    # closing script
    js_end = """
      })();  
    </script>
    
    """

    # Not the actual part that generates the Markers based on the code from 
    # the data crawled.

    js_markers = ""
    for i,r in enumerate(apartments.iterrows()):
        d = r[1]
        addr = d.addr.encode('utf-8')
        content = content_template.format(link=d.link, addr=addr,
                                           time_to=d.time_to, miete=d.miete,
                                           sqm=d.sqm)
        js_markers +=  marker_template.format(i=i, lat=d.lat, lng=d.lng,
                                              title=addr, content=content)

    html = html+js_init+js_markers+js_end
    if display:
        return HTML(html)
    else:
        return html

Now we can call this function and see the map:

In []:
map_pos_apartments(apartments)

The only issue is that this code is executed on the fly, so in order to visualize this the code would have to store it first or load it automatically somehow. As a replacement I am attaching an IFrame showing the results of the code above.

In [47]:
HTML('<iframe src="http://mfcabrera.com/files/ichbineinberliner/" width=800 height=400></iframe>')
Out[47]:

Now we have a nice responsive and interactive map with the apartments matching our criteria. If we click a marker we get more information about each available apartment.

AS the HTML constructor only takes HTML/JS text source code, we could also store it in a file, so we can embedd it somewhere else.

In [54]:
html_src = map_pos_apartments(apartments, display=False)

init_script = """ <script type="text/javascript"
      src="https://maps.googleapis.com/maps/api/js?key=AIzaSyD1tR9ag8ImBLr4BJdr-ZMTP0bFOXPJFUk">
    </script>"""

with open("index.html", "w") as f:
    f.write("<html><head> " )

    f.write(init_script)

    f.write('<style type="text/css"> \
            .map-canvas { height: 800px; } \
            </style>')


    f.write('<script type="text/javascript">')
    f.write("google.maps.event.addDomListener(window, 'load', initialize);")
    f.write("</script>")
    f.write("\n\n {} </head><body><div id='miete' class='map-canvas'/>".format(html_src ))
    f.write("</body></html>")
In [55]:
!open index.html

Conclusion

We managed to build a nice visualization of candidate apparments that is definetly helpful for moving to a new city. This definetly does not get her an apartment automatically. However, linking the apartment listing with transit information narrows down the search a lot and automates some of the most boring tasks.

This small project even made me take a look into the Open Data and the Open Knowledge movementand their standing in Germany.

I am also glady surprised by the capabilites of IPython Notebooks and this made me realize that I finally need to learn to code properly in Javascript.