Moving to Berlin with the help of IPython and friends

Berlin Skyline

TL;DR

I help my girlfriend look for a flat using (I)Python and friends. I plot in a map apartments that match her criteria along with the time it takes to reach her workplace using public transportaton. I showcase some Python libraries like Pandas and Scrappy along with some features of IPython notebook to work with Google Maps API. Code and notebooks available on Github.

Intro

Moving to a new city is not an easy task. Among all the things, one of the most time consuming is finding a place to live. It is not easy, there are many variables to take into account and if you don't use an agency looking for one might be boring and repetitive.

Assuming you use the web to get some possible apartments, once you find a good candidate, you generally have to check the address, check the surroundings (e.g. Stores, Cafes) and which public transportation is available. You generally also have to check how long it will take you to get to your working place or the city center, either by car or public transportation. This is important, as Stutzer and Frey found "that a person with a one-hour commute has to earn 515 Euro more (or 40% of an average monthly wage in Germany) to compensate for the dissatisfaction caused by their long commute" [source].

If you use Google or specialized websites like ImmobilienScout24 (in Germany), you probably have to go through process of searching for it, checking wheter that apartment matches your criteria (i.e. number of rooms, size, rent price, etc). In addition to that, you have to check how far or how much time will you need to get to work.

There is actually a nice tool written by a Berliner that can help you with the last part called Mapficient. Mapficient can show you graphically areas you can reach with public transport in a given time and it is available for many cities. However, in order to use the tool you have to add the latitud and longitude coordinates manually for each of the candidate apartments.

That is the problem that my girlfriend is facing. She is moving to Berlin next month and she wants an apartment that matches her criteria and from where she could reach her working place in the shortest possible amount of time. So I decided to help her (us?) a bit with some assistance of Python/IPython and some services.

Since I read Karim's blog post and attended his presentation at the Munich Datageeks Meetup, I got interested in how to harness open data to automate or improve otherwise boring and time-consuming tasks.

I also googled a little bit before coding and stumpled upon a nice article by Robin Clarke, a guy living in Munich, and how he looked for an area in the city where he could reach the center of Munich in a specific time. He even built a super duper visualization that you can see below:

In [49]:
from IPython.display import HTML
HTML('<iframe src="https://www.google.com/fusiontables/embedviz?viz=MAP&q=select+col1+from+2304677+&h=false \
     &lat=48.19187395469069&lng=11.499547000000007&z=10&t=1&l=col1" width=800 height=400></iframe>')
Out[49]:

The lighter the area the less time you need from that location to reach the Munich's city center. In theory you could calculate something similar from any point to another arbitrary point in a city (and that's what Mapficient does), but I did not want to do anything that complex as I like more Street Fighting Data Science.

However this gave me an Idea: Why don't I plot in a Google Map only the apartments that have the characteristics I want (she wants) along with the time it takes to get to my girlfriend's workplace? I don't know the Google Map API nor Javascript but it can't be that hard.

Getting the Data

Here comes Web Scraping handy. Althought I had never used it, I knew there was a popular framework for Python called Scrapy. This was a nice opportunity to learn a bit about it. I wrote a small python project that scraps ImmoScout24 listings and stores the results in a JSON file. Before doing that, it uses Google Map services to geocode the address and calculate the distance to my girlfriend's workplace using public transportation. To do that, I use what Scrapy calls an ItemPipeline and Google Maps services client. I only limited it to Kreuzberg, Schoenegerberg and Charlottenburg as they still close to the city center but are still in the direction of her workplace.

The class that actually does the magic looks like this (You can find the full code along with this notebook on Github. )

In [16]:
class AddDistanceToMPIPipeline(object):

    latlong_mpi = str((52.444311, 13.273748))

    def __init__(self, ):
        self.gm_client = googlemaps.Client("_PUT_API_KEY_HERE")

    def process_item(self, item, spider):
        orig = item["addr"]
        geoloc = self.gm_client.geocode(orig)

        if len(geoloc) > 0:
            for k in ('lat', 'lng'):
                item[k] = geoloc[0]['geometry']['location'][k]

        directions_result = self.gm_client.directions(str((item['lat'], item['lng'])),
                                                      self.latlong_mpi,
                                                      mode="transit",
                                                      departure_time=1421307820)

        #  Pick the fastest way
        chosen_leg = None
        if len(directions_result) > 0:
            for dr in directions_result:
                for l in dr["legs"]:
                    if chosen_leg is None:
                        chosen_leg = l
                    if chosen_leg is not None and \
                       chosen_leg["duration"]["value"] > l["duration"]["value"]:
                        chosen_leg = l

        if chosen_leg is None:
            return
        item["time_to"] = chosen_leg["duration"]["value"]/60.0
        return item

Taking a look at the data

So after scraping the website for a while we have a file with all apartments available. We can use Pandas to load the data and take a look at it:

In [33]:
import pandas

with open('../ichbineinberliner/items.json') as f:
    data =  pandas.io.json.read_json(f)

Pandas also can do some SQL like filtering of the data. So let's assume my girlfriend wants an 3-room apartment (in Germany the living room is counted as one Zimmer (room)). She also wants to be able to get to her job in less than 40 minutes and the monthly rent should be less than 800 euros.

In [52]:
apartments =  data[data.zimmer == 3][data.miete <= 800][data.time_to <= 40].sort('time_to')
apartments[['addr', 'link', 'sqm', 'time_to', 'zimmer']].head(10)
Out[52]:
addr link sqm time_to zimmer
541 Schöneberg (Schöneberg), 12157 Berlin http://www.immobilienscout24.de/expose/78832352 98.74 26.716667 3
524 Dominicusstraße 40, Schöneberg (Schöneberg), 1... http://www.immobilienscout24.de/expose/78752552 86.37 28.916667 3
581 Sybelstraße 17, Charlottenburg (Charlottenburg... http://www.immobilienscout24.de/expose/78081827 72.25 29.533333 3
582 Ebersstrasse 15, Schöneberg (Schöneberg), 1082... http://www.immobilienscout24.de/expose/78826435 76.00 29.900000 3
301 Charlottenburg (Charlottenburg), 10629 Berlin http://www.immobilienscout24.de/expose/76914555 79.00 31.233333 3
511 Charlottenburg (Charlottenburg), 10625 Berlin http://www.immobilienscout24.de/expose/77718892 62.00 33.750000 3
489 Dernburgstr. 43, Charlottenburg (Charlottenbur... http://www.immobilienscout24.de/expose/77665277 70.00 34.900000 3
636 Sachsendamm 78, Schöneberg (Schöneberg), 10829... http://www.immobilienscout24.de/expose/78941951 73.34 35.516667 3
20 Otto-Suhr-Allee , Charlottenburg (Charlottenbu... http://www.immobilienscout24.de/expose/56455442 69.00 35.750000 3
15 Olbersstr. 2, Charlottenburg (Charlottenburg),... http://www.immobilienscout24.de/expose/59454790 60.00 36.233333 3

Visualizing the Data

So now we have some apartments that match our criteria. At this point I could send me the table above via email, and make the crawling run each day so I get notified of new available apartments. But I really wanted to visualize it in a better way.

I discovered that IPython notebook can embed and execute Javascript and HTML, thus embedding a Google Map in a cell is possible. The notebooks from the class Working with Open Data of the UC Berkely helped me to get started. Doing this is not that simple (a better support should be possible) but it is not hard either.

The first thing is to initialize the Google Maps API:

In [35]:
from IPython.core.display import HTML, Javascript
def gmap_init():
    js = """
window.gmap_initialize = function() {};
$.getScript('https://maps.googleapis.com/maps/api/js?v=3&sensor=false&callback=gmap_initialize');
"""
    return Javascript(data=js)
gmap_init()
Out[35]:

Then we declare the properties of the div where we are going to displaye the map:

In [36]:
%%html
<style type="text/css">
  .map-canvas { height: 400px; }
</style

Rendering the Map

Now comes the part were we generate the map. What we are going to do is to generate the Javascript code that renders the map. Then we can either display it in a cell using the IPython notebook HTML object or just store in a html file and upload it somewhere.

I created the small function below that generates the image (check the code comments for more info):

In [37]:
from IPython.core.display import HTML, Javascript

def map_pos_apartments(apartments, display=True, lat=52.4798023, lng=13.3563576, zoom=12):

    div_id = "miete" # name of the div where are we are going to display the map.
    html = """<div id="%s" class="map-canvas"/>""" % (div_id)


    # This is a template for the infobox that we are going to present to the user when he clicks a 
    # Marker
    content_template = """'<ul style="list-style: none;padding:0; margin:0;">' + 
    '<li> <a href="{link}" target="_blank"> {addr} </a></li>' +
    '<li><b>Time to MPI</b>: {time_to:.2f} min</b> </li><b>Size:</b> {sqm} m<sup>2</sup><li></li>' +
    '<li><b>Rent:</b> &#8364; {miete}</li></ul> '
    
    """
    # This is the template for a Marker on the map.  It also contains the code for generating the "Infowindow"
    # That appears when clicked. 
    marker_template = """
        var myLatlng = new google.maps.LatLng({lat},{lng});
        var marker_{i} = new google.maps.Marker({% raw  %}{{ {% endraw %}

        position: myLatlng,
        map: map,
        title:"{title}"
        }});
    
         var contentString = {content};

          var infowindow_{i} = new google.maps.InfoWindow({% raw  %}{{ {% endraw %}

          content: contentString
          }});
    
          google.maps.event.addListener(marker_{i}, 'click', function() {% raw  %}{{ {% endraw %}

            infowindow_{i}.open(map,marker_{i});
            if (lastWindow) {% raw  %}{{ {% endraw %}

                lastWindow.close();
            }}
            lastWindow = infowindow_{i}
      }});
    
    """
    ## JS intitalization code.
    js_init = """
    <script type="text/Javascript">
      (function(){
        var mapOptions = {
            zoom: %s,
            center: new google.maps.LatLng(%s, %s)
          };

        var map = new google.maps.Map(document.getElementById('%s'),
              mapOptions);
              
        var lastWindow = false;
        
        var transitLayer = new google.maps.TransitLayer();
        transitLayer.setMap(map);
              
              """ % (zoom, lat, lng, div_id)

    # closing script
    js_end = """
      })();  
    </script>
    
    """

    # Not the actual part that generates the Markers based on the code from 
    # the data crawled.

    js_markers = ""
    for i,r in enumerate(apartments.iterrows()):
        d = r[1]
        addr = d.addr.encode('utf-8')
        content = content_template.format(link=d.link, addr=addr,
                                           time_to=d.time_to, miete=d.miete,
                                           sqm=d.sqm)
        js_markers +=  marker_template.format(i=i, lat=d.lat, lng=d.lng,
                                              title=addr, content=content)

    html = html+js_init+js_markers+js_end
    if display:
        return HTML(html)
    else:
        return html

Now we can call this function and see the map:

In []:
map_pos_apartments(apartments)

The only issue is that this code is executed on the fly, so in order to visualize this the code would have to store it first or load it automatically somehow. As a replacement I am attaching an IFrame showing the results of the code above.

In [47]:
HTML('<iframe src="http://mfcabrera.com/files/ichbineinberliner/" width=800 height=400></iframe>')
Out[47]:

Now we have a nice responsive and interactive map with the apartments matching our criteria. If we click a marker we get more information about each available apartment.

AS the HTML constructor only takes HTML/JS text source code, we could also store it in a file, so we can embedd it somewhere else.

In [54]:
html_src = map_pos_apartments(apartments, display=False)

init_script = """ <script type="text/javascript"
      src="https://maps.googleapis.com/maps/api/js?key=AIzaSyD1tR9ag8ImBLr4BJdr-ZMTP0bFOXPJFUk">
    </script>"""

with open("index.html", "w") as f:
    f.write("<html><head> " )

    f.write(init_script)

    f.write('<style type="text/css"> \
            .map-canvas { height: 800px; } \
            </style>')


    f.write('<script type="text/javascript">')
    f.write("google.maps.event.addDomListener(window, 'load', initialize);")
    f.write("</script>")
    f.write("\n\n {} </head><body><div id='miete' class='map-canvas'/>".format(html_src ))
    f.write("</body></html>")
In [55]:
!open index.html

Conclusion

We managed to build a nice visualization of candidate apparments that is definetly helpful for moving to a new city. This definetly does not get her an apartment automatically. However, linking the apartment listing with transit information narrows down the search a lot and automates some of the most boring tasks.

This small project even made me take a look into the Open Data and the Open Knowledge movementand their standing in Germany.

I am also glady surprised by the capabilites of IPython Notebooks and this made me realize that I finally need to learn to code properly in Javascript.

Attending the Lisbon Machine Learning Summer School

LxMLS 2014 logo

TL;DR

This year I had the chance to attend the Lisbon Machine Learning (Summer) School LxMLS. In this post I want share my experiences as well as of give my opinion on how some things could be improved.

Introduction

Lisbon Machine Learning Summer School (LxMLS) is an intensive school on Machine Learning held in the beautiful city of Lisbon in Portugal. What is special about this school, is that it has an specific emphasis on applications of Machine Learning in the field of Natural Language Processing (NLP). I believe this is because most of the organizers are somehow connected to NLP research groups and companies in Lisbon as well to the CMU Language Technology Institute and the IST. This was appealing to me as I wrote my M.Sc. thesis on the topic of applications of word vector representations. However, I only used them for document classification when they their more interesting applications are in the field of NLP.

The summer school was designed in such way that in the morning we received a tutorial / lecture on a specific topic of ML & NLP; and in the afternoon, a practical session was held directly related to the lecture received in the morning. After the labs, a short presentation featuring more research results was scheduled.

In this post I describe briefly the summer school school and I give an account of my experience. This is my personal opinion and thus, it might not be shared by many of the attendees as the school experience will vary based on their backgrounds and expectations.

Day 0 - Tuesday

The first day started with a quick introduction and presentation of the summer school. Shortly afterwards Prof. Mario Figuerido kicked off the school with a review of basic probability concepts. I have to admit that it was helpful to refresh some things. The morning talk was an introduction to Python by Luis Pedro Coelho from the EMBL, which was necessary for those with no previous knowledge of the programming language, as the programming exercises of the lab sessions required basic Python knowledge. The Lab session of the afternoon was focused on getting Python installed. There were a couple of exercises about gradient descent. The fist day we did not have an evening short talk but instead a welcome reception where we could do some networking while eating snacks and drinking wine.

Day 1 - Wednesday

The second day we had an introduction to machine learning and linear classifiers by Ryan McDonald from Google. I liked this presentation, however at the beginning I had issues understanding the part on how the feature extraction worked (as it was a bit NLP-focused). At times the slides got filled with mathematical derivation that in my opinion did not help to make the concepts clearer.

In the evening we had to implement the Multinomial Naïve Bayes for document classification. We basically had to fill the train method from a existing class (it reminded me of the ML-class from Coursera). As it happens, I have never implemented this (it is -/a well motivated- counting/- basically), so it was nice to finally do it.

The evening talk was a tutorial on Scikit-Learn by Andreas Müller, one of the main contributors to the project. The funny thing is that I had used parts of his tutorial for my talk on SVMs at the Datageeks Meetup.

Day 2 - Thursday

On day three we had a talk on sequence models by Noah Smith, leader of the CMU's group Noah's Ark. I was particularly interested in this talk because I have no experience with sequence models or with their application to NLP tasks. I really liked the presentation style and the slides, however by the end of the talk the it became a bit esoteric for me and it was hard to follow.

In the practical session we were required to implement the Viterbi algorithm. I was at the beginning pretty lost but with the help of some of the instructors I was able to complete the task.

After the lab we went to the LxMLS Demo Day where local companies and research groups showed their products and research. From all the boots I liked two local companies: Unbabel and Priberam. Unbabel is a Y-Combinator backed that offer crowdsourced human corrected machine translation. I found the service pretty cool. Priberam is a company offering a NLP related services. They have a strong research group connected to the Instituto Superior Técnico.

Day 3 - Friday

Friday morning we had a talk on learning structured predictors from Xavier Carreras of Xerox Research. This was another one I liked and I was looking forward to it. For most of the talk I could somehow follow, but once again by the end it was a little hard to get everything. The afternoon lab was about implementing the structured perceptron algorithm.

I have to say I could not follow the evening talk at all. It was something related to Spectral Learning by Ariadna Quanttoni. I think it was too specialized and the scope was not appropriate for the summer school.

lxmls presentation

Noah Smith's Lecture - Photo by @DH_FBK

Day 4 - Saturday

For Saturday we had a talk on Syntax and Parsing form Slav Petrov. This was really hard to follow. I guess if you don't have a good background in NLP that would be the case. However, the main concepts were understandable. In the lab session we were required to play with existent code related to parsing.

The evening talk was given by Dipanjas Das from Google on Cross-Lingual Learning in Natural Language Syntax. Pretty advance topic, but I really liked how the presenter moved from basic concepts to more complex techniques. This was one of my favorites presentation.

Day 5 - Monday

After the free day (which I used to visit Sintra along with some cool people I met in the school - High guys :D!) We returned to the school for the last two days. Monday was the day for the Big Data topics with CMU Prof. Chris Dyer. I also liked this one. It does not only showed the basic of Map Reduce but also strategies on how to implement the ML and NLP algorithm using this paradigm. The Afternoon labs were introduced the basic concepts of MR with the almost pathological word counting problem.

The afternoon lecture was about Cross-Lingual Semantics by Prof. Ivan Titov of University of Amsterdam. I couldn't follow this talk a lot either.

Day 6 - Tuesday (Final Day)

I was really looking forward to the talk on Deep Learning by Richard Socher. I had already watched his Tutorial on Deep Learning so I was familiar with many of the topics. His tutorial was really helpful to understand basic concepts of Deep Learning. For the afternoon labs we were required to write/execute a Map Reduce version of the expectation maximization algorithm. I did not do much here because I was writing this post :D

I could not attend the afternoon talk because there were some problems in the red line of the metro, which communicates the city with the airport. Thus I preferred to go to the airport a bit early and I only could catch the first 15 minutes of the lecture.

lxmls presentation

The Good

There were many things that I like about the summer school, I will list here the ones I believe are the most important:

  • The people: From the organizers to the attendees. All the people I had the chance to interact with were really nice. I had interesting conversation with all of them.
  • The Location: Lisbon. What a beautiful city. It is the perfect location for such event. The IST campus is quite central and easy to reach.
  • The talk ordering was well planned. Every lecturer built on the knowledge acquired in the previous session.
  • The Topics: The school covered from the very basics to current topics such as deep learning.
  • The Speakers: All of the speakers are renown researchers and academics coming from prestigious institution and companies. We had speakers from Amazon and Google, as well from universities such as CMU and Stanford.
  • The Organization Team: The organizers / tutors were always ready to help. Every time I had a question or doubt they tried their best to explain.
  • Using Python for the Labs. I think that Python is becoming the standard for NLP / ML / Data Science. It is also a easy programming language to learn.

Things to improve

Not everything can be perfect right? There are quite a few things that in my opinion can be improved for future schools:

  • Use IPython notebooks for the Labs. I think the toolkit provided by the school is pretty nice, but the fact that people have to install Python makes the first day not really smooth. IPython can be served remotely very easy so people can access it through the web interface. Also, a virtual machine could be prepared so the persons do not need to tinker around installing Python and the required packages.
  • I did not like a lot the way the labs were executed. I liked the chosen topics and the instructors/tutors were really helpful. However, I believe they should be made more interactive. Also, an explanation of the basics of each exercise as well how each part of the algorithm relates to the code was missing. I had a lot trouble understanding a simple part that could have been easily explained (but it was not clear in the guide).
  • More social events / group activities. I would have loved to have more group activities. We had dinner in a fancy restaurant but doing other stuff together would have been nice. A group visit to the castle would have been a nice idea.
  • The Auditorium. The Auditorium were the main lectures were held was OK but for one thing. The space between the chairs was minimal. I am not that tall and when I couldn't get a place in the last row or the middle one (both with extra space) I was all the time uncomfortable.
  • Add a (optional) poster session along with the demo day. I think that might be helpful for early researchers to discuss the approaches with the speakers (that are generally experts on the field).
  • The Canteen. The Canteen was relatively small and it was shared with students. So, people had to run to be able to eat 1 hour time-slot. Either giving more time to have lunch or looking for alternatives is necessary for next events.

Conclusion

I had a great experience coming here. I had the chance to meet interesting people as well to learn (or at least get informed) about ML and its applications to NLP. The talks were relevant and the topics well chosen. I really enjoyed most of the lectures.

I don't think people come to the summer school to actually learn something. It is really hard to fully understand advanced topics in machine learning and NLP in just one day. One can however, get informed about the subjects and learn what needs to be learned. Also, the networking part is really important, even more for those early Ph.d. students as they have the chance to validate approaches with other researchers and experts in the field.

Lisbon is a magical. I really loved it. It is a beautiful and organized city. The public transportation works pretty well and there are many interesting places to visit both in the city and in the surrounding area.

I would recommend this summer school to anyone interested in NLP and in ML in general. You will not only have the chance to learn but you will also enjoy what the beautiful Lisbon and Portugal have to offer.

Taking notes and blogging with style

HIPSTER IMAGE

TL;DR

I explore a set-up to share notes among all my devices (namely my Laptop, tablet and smart-phone) and make them available on the cloud, plus how to even work with them from Emacs. I even show how I use the same system for blogging. If you are a OSX and Emacs use and also happen to have a device from Apple (e.g. Macbook Pro or iPhone) read below.

I have always been obsessed with personal productivity, and even though I sometimes thing that the whole thing is pointless, I keep trying to improve my "workflow" to make me more productive. Around 6 months ago I found a way to combine some applications together to take notes everywhere. Thus, I would like to share with the world.

I have separated the configuration by levels of "geekness". Level 1 is for more normal people while upper levels are more for über geeks (those which use Emacs / Org Mode). I will also show how I how use the same note taking system to even write the blog you are reading right now.

I know many of the current Org-users that read this might think I am not using the complete power of org-mode, and that org-mode offer other alternatives, but this way it is simple enough for me and everyone is free to adapt it as they want (isn't org-mode flexibility all about?).

Intro

You can safely skip this part if you want to get the set-up running. Here I just want or rant a little bit about my experience in the note taking field.

As I mentioned before, I have always been obsessed with personal productivity. In particular to-do list and note taking. Back in the days, I used simple text files for storing my notes, then I switched to personal wiki solutions like the once popular Tomboy . Then I switched to integrated solutions like org-mode . I think org-mode is pure awesomeness but I just do not like as the notes are stored and accessed. The main issues I have found when using org-mode are the following:

  • In org-mode the usual set-up implies all the notes are stored in one or some files in a tree fashion.
  • Search in Org-Mode is not as flexible and quick as I would like it to be.
  • Accessing my org files from other devices different of my computer was a PITA. and MobileOrg did not cover my needs (and was also difficult to Set-up).
  • Furthermore, Integrating org-mode although possible is sometimes too difficult or cumbersome.

However, org-mode has a lot of things that make it rock:

  • Can use the awesome Emacs to take notes.
  • Simple format for storing information that can be used with or without syntax highlight.
  • The ability to export to many formats including Latex, PDF and HTML.
  • Syntax for highlighting when exporting.
  • Executing of source code via Org Babel
  • And more, much more.

So I stared looking initially for ways to be able to access my notes from everywhere. Once I got that figured out I started looking for ways to use org-mode to takes notes while at the same time being able to have read write access from wherever I wanted. And below you will find my solution.

Set-Up

Before starting to describe I want to clarify what hardware I possess. The setup might be replicated with a different set of devices or operating system combination. I will try to add the alternatives.

  • Laptop with Mac OSX: Macbook Pro
  • Cellphone: iPhone (>=4)
  • Tablet: iPad (>=3)

(Yeah, yeah… you can call me Apple fan boy but I'm not!)

LvL 1: Notes in all devices

In order to achieve this we only need a couple of applications. Notational Velocity and SimpleNote.

Notational Velocity & SimpleNote

Notational Velocity is an open source light weight text editor with really nice functionalities. It features a simple interface with a search bar and a list of notes.

Notational Velocity

Notational Velocity

Notational Velocity allows the user to select in which directory to store the notes. So if we select a Dropbox folder we can browse it basically from everywhere.

To access the notes in other devices, I use SimpleNote (if you don't know this application you should totally check it out). Notational Velocity can sync your notes with SimpleNote. The Web version of SimpleNote is free but you have to pay for this app in order to use it in iOS device. It is cheap and totally make easier to share your computer notes with you iOS device(s).

Here is actually how SimpleNote looks on the iPad:

SimpleNote on iPad

So by using Dropbox, Notational Velocity and SimpleNote you can:

  • Take plain text notes using a free cool app.
  • Access it everywhere using Dropbox (if you need to, but this is more of a emergency).
  • Take notes online using the Simple note Web interface or any of the clients it has. Notational velocity will sync back to your hard drive and Dropbox folder.
  • Sync and take notes using any or your devices iPhone or Emacs. I guess this configuration will also work or Android based devices, but I have no experience with Android based devices. However by this point, the main idea should be clear and should not be hard to replicate.

So you can stop reading here if you are not a user of Emacs (or if this level of Geeknes is enough for you).

LvL 2: Using Emacs to take notes

OK if you are reading this that means that you are an active user of Emacs and that you want to see how you can integrate the previous Workflow with it. Well it is actually quite easy.

There is a special mode in Emacs called Deft with basically provides Emacs a similar interface for note taking as Notational Velocity does.

Deft Mode

Deft configuration is simple (add this to .emacs ):

(when (require 'deft nil 'noerror) 
  (setq
   deft-extension "txt"
   deft-directory "~/Dropbox/Notational Data/"
   deft-text-mode 'org-mode
   deft-use-filename-as-title t
   )
  (global-set-key (kbd "<C-f9>") 'deft))

This should be self-explanatory, but the important things are the :deft-directory option, which I select the same directory on my Dropbox folde. The :deft-text-mode option allows me to chose a default mode for my note files.

LvL 3: Writing a blog from everywhere

So far we have the ability of writing and reading or text or org based mode from everywhere. This can be used to write draft of posts for your blog from basically everywhere. In my personal case I use Jekyll to generate this blog. Jekyll natively support markup formats like Markdown or Textile, but you can use raw HTML as well. This last feature allows me to write my posts using org-mode format and export them to HTML. This can be easily automated by defining a project.

(setq org-publish-project-alist
   '(

     ("mf" :components ("mf-org"
                         "mf-img"))

      ("mf-org"
               :base-directory "/Users/miguel/Dropbox/Notational Data"
               :recursive t 
               :base-extension "blog.org.txt"
               :publishing-directory "/Users/miguel/Dropbox/blog-stuff/mfcabrera.com/_posts"
               :site-root "http://mfcabrera.com"
               :jekyll-sanitize-permalinks t
               :publishing-function org-html-publish-to-html
               :section-numbers nil
               :headline-levels 4
               :table-of-contents nil
               :auto-index nil
               :auto-preamble nil
               :body-only t
               :auto-postamble nil)

      ("mf-img"
               :base-directory "/Users/miguel/Dropbox/blog-stuff/source/"
               :recursive t
               :exclude "^publish"
               :base-extension "jpg\\|gif\\|png|jpeg"
               :publishing-directory "/Users/miguel/Dropbox/blog-stuff/mfcabrera.com/files/images"

               :publishing-function org-publish-attachment)



))

There are three project mf-org and mf-img and mf. With mf-org I define base directory (where I have all my notes). :base-extension tells org-mode to only take into account files ending in .blog.org.txt (So I don't publish all my notes). With :body-only I tell org to export only the body of the HTML. mf-img takes images from a directory and copies to /files/images/ directory using the function :org-publish-attachment.

This allow me for example to start writing a blog-post using my iPad (or even my cellphone) and then when I get access to a computer just format it a bit and generate it. However, this apparently has not made my blog more active. However, I will try to actually do it more often now that I finally finished my M.Sc. degrees :).

Comments? ideas? improvements? Comment or write me an e-mail.

Playing with word2vec and German

I know I had promised before to blog more often but I always find excuses for not doing it :).

I am right now writing my master thesis, which is good because is already about time to finish my M.Sc. program. After exploring current offerings my university, I finally made up my mind and decided to write it inside the company I am currently working for. This is actually cool, because allows me to try things with real world data. The main topic is going be around using deep learning for Natural Language Processing (NLP) tasks, in particular to improve the models the company currently uses. We might explore other interesting use cases as well.

My main advisor is going to be Christoph Ringlstetter, chief scientist at Gini and Research Fellow at LMU's Center for Information and Language Processing (CIS). However I am also advised by other cool people which agreed to help out and offer guidance.

As I mentioned the focus of my project is going to be Deep Learning applications for NLP tasks. The language of focus is going to be German (as most of the data in Gini is in German). That is also funny, because my German kind of sucks and Ich bin weit davon entfernt, fließend deutsch sprechen zu können. But I think is also good, the idea of Deep Learning isto learn features without knowing anything about the data in advance. Also, I will have the support from my team Gini and my advisor so I think it will be all right (plus a good opportunity to improve my German screaming skills).

As I mentioned before, I am going to be working in Deep Learning, which is currently one hottest fields in Machine Learning. Deep Learning focuses on learning high level abstract features from unlabeled data. These features can then be used to solve existing problems.

In the field of Natural Language Processing (NLP) a recent model called Word2Vec has caught the attention from practitioners by its "theoretically-not-so-well-founded-but-pragmatically-superior-mode". Released as an open source project, Word2Vec is an Neural Network Language model, developed by Tomas Mikolov and other guys at Google. It creates meaningful vector representation of words. Each of the component of the vector is somehow a similarity dimension which captures both syntactic and semantic information of a word. Check ThisPlusThat.me for an example usage and this presentation for a nice explanation.

For writing the article, I am going to use Python, which I've fallen in love with this year given the huge amount of tools for doing data science / machine learning related tasks. There are basically two ways we can use use/train word2vec wordvector from python: one is using the word2vec wrapper that Daniel Rodriguez developed. The other way is to use it through Gensim by Radim Rehurek.

iPython, Word2Vec and Gensim

As I mentioned I am interested in the behavior of the word representations with the German language so I trained word2vec using the 3x10^9 bytes of the a German Wikipedia dump. To train with the Wikipedia, we have to get the XML dumps and "clean" it from tags. To do that I adapted the script found at the end of this page to German. Basically replacing German "funky" characters. I uploaded the adapted version as a Gist.

As for the training parameters for this particular test I used the skip-gram model and called word2vec like this:

  time word2vec -train dewiki3e9.txt -output de3E9.bin -skipgram 5 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 -save-vocab defull-vocab.txt 

For the purpose of this blog, I am choosing Gensim because it is a native re-implementation in python and offer nice functionality alread, and for a nice interactive programming enviroment I used iPython Notebooks and embedded the HTML output it below.

So, stop talking and let's start coding:

In [1]:
#let's get Gensim. I am assuming you have successfully installed it

from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('../wordvecs/de3E9.bin',binary=True)

This takes time for this particular file. The vector file is almost 1GB and it has to be loaded in memory.
Once loaded the model we can some of experiments found in the the paper and see how this particular model performs.
One of the cool example if that you can take the vector representing 'king' add the vector of 'woman' and subtract the vector of 'man' and you will get vector which cosine distance is most similar to the vector representing 'queen'. Let's see if that is true for this model:

In [3]:
model.most_similar(positive=['koenig', 'frau'], negative=['mann'])
Out[3]:
[('gemahlin', 0.72522426),
 ('gattin', 0.64882195),
 ('edgith', 0.64861459),
 ('koenigs', 0.64086556),
 ('vladislavs', 0.63747227),
 ('mitregentin', 0.63738412),
 ('koenigsgemahlin', 0.63574708),
 ('koenigin', 0.63131845),
 ('thronansprueche', 0.62454271),
 ('regentin', 0.62117279)]

Well it does not. But it does not surprise me. We do not have all the data available and the training parameters were chosen arbitrarily so no surprise that it does not work. However We got the word 'gemahlin' which is normally useful to refer to the wife of a King (consort). The word 'gattin' is also used for spouse. However we do see the word 'koenigin' and 'koenigsgemahlin' which is the translation for 'queen' and 'royal consort'. Let's see whats happen if I just add the words

In [17]:
model.most_similar(positive=['koenig', 'frau'])
Out[17]:
[('gemahlin', 0.72934431),
 ('koenigin', 0.70212948),
 ('ehefrau', 0.67596328),
 ('gattin', 0.67325604),
 ('lieblingstochter', 0.66053975),
 ('maetresse', 0.65074563),
 ('nantechild', 0.64813584),
 ('koenigsgemahlin', 0.64198864),
 ('eadgifu', 0.6408422),
 ('gemahl', 0.64082003)]

Drawing

Wow well, almost :) - Only adding 'frau' to 'koenig' gave me in the top positions both 'queen' and 'consort'.

As I live in Munich, we often go on Fridays to have a Weisswürstfrühstuck or a traditional Müncher/Bayerisch breakfast. It is basically White sausage, sweet mustard and pretzel (accompained with an optional Wiessbier or wheat beer). Let see if our Word2Vec model can differentiate the components of this delicious meal.

In [26]:
model.doesnt_match("wurst senf brezn apfel".split())
Out[26]:
'apfel'

This actually worked pretty well. The model was able to capture that a 'apple' is not part of the traditional breakfast :)

On the referenced papers on word2vec web page they describe some task both semantic and syntactic. let's try one of those and see how it works. This question basically asks, 'berlin' is to 'deutschland' what 'london' is to 'england'. So basically, country capital relationships.

In [29]:
q = ["berlin", "deutschland", "london", "england"]
model.most_similar(positive=[q[0],q[3]],negative=[q[1]])
Out[29]:
[('dorset', 0.55140525),
 ('london', 0.54855478),
 ('sussex', 0.54572964),
 ('cornwall', 0.54447097),
 ('suffolk', 0.54392934),
 ('essex', 0.53380001),
 ('oxfordshire', 0.51856804),
 ('warwickshire', 0.51826203),
 ('edinburgh', 0.51790893),
 ('surrey', 0.51409358)]

So, the top answer is 'dorset' which is a county of England way in the south. But the second one is actually London. So, not bad. As I mentioned above, this model was trained basically with default parameters and with not necessarily a big dataset (as the one in the paper) Therefore, the embeddings might not be as accurate as desired or capture all the information that we would like.

Well, this was just basic test with a basic model. I will continue trying different parameters and of course understanding the model and implementation a bit more. There also interesting question to be answered, like for example how to represent long documents using word vectors or matching phrases properly However I see a lot of potential for applications in NLP ranging from basic document classification to more complex named-entity recognition.

Introducing Munich DataGeeks

Some weeks ago the second meeting of Munich DataGeeks took place. Munich DataGeeks is a meetup group I am organizing with Florian Hartl about what we like to call intelligent data processing. After two meet-ups, with around 45+ attendants each, I feel comfortable saying that it has been a success. In this post I am going to describe the idea behind the meetup, the structure of it and some plans for the future.

You might be wondering what are we referring to when we say intelligent data processing? By that we refer to all those technologies and techniques used to extract, process and get insight from data. We don't limit ourselves to a particular technology/buzzword or a programming language. Therefore, the topics discussed will be really diverse. A presentation in our group will probably include things like NLP, Machine Learning, Information Retrieval, event processing, Data Science and related; and technologies such as Hadoop, NoSQL, Python, R, etc.

The idea of creating this group came one day after Florian and I shared some Bier (Many good ideas come after doing that). We realized that in Munich there was not a place where people interested in such areas could share knowledge and meet other like-minded persons. We decided then to fill that gap by organizing such group, having regular meetings featuring talks on those topics. For the name, we took the inspiration from an already existing group in Paris.

For planning the first event we decided to write down a set guidelines containing the philosophy behind the group. We thought that by doing it, it will help us to keep the meeting with the same structure. They are not carved in stone, but so far we believe they have worked pretty well. I am going to share them so you can have an idea of what to expect in our meetups:

  • (Almost) Monthly Meetup: We want to have one meeting every 4-6 weeks maximum. It is not a hard limit but we don't want to have a lot of time between meetings. We want a engaged community and a continuous exchange of ideas :).

The First meeting at TrustYou.
  • Networking is important: We don't want people to come and listen to talks like zombies. We actually want people interested in similar or related fields know each other. For that reason we will have time for networking at the beginning, between talks and at the end. The sponsor will provide Bier and food to loosen up the people and spark friendly conversations :).
  • Not to product pitches: We like talks that mix concepts with practical application. However, we don't want to be a venue for closed / paid products or companies. We that know sometimes people work in specific environments that are not open (e.g. Matlab) however, we think if the concepts are more important that the tool itself then there is no problem, but if the talk is centered on a specific tool, then the tool should be free / open source / available.
  • Research and Industry: We want to get into the same room software developers / data scientist with researchers. We believe that were people working in the industry (or "real life") can meet researchers. Both sides can benefit from it. Researcher might see what are the current problems that the industry faces and developers can see fresh results or current research trends.
  • Talks are held in English: Although we are in Germany and German is the official language, there are plenty of people which are not fluent enough (I include myself in this group) in it. In addition to this, we want to include recently landed M.Sc. / Ph.d / Post Docs and developers / data scientists that not necessarily speak German.
  • From the community to the community: The idea of the meetings is to share knowledge from within the community in Munich. So we encourage people and attendants to present. We do invite people we know that work on the subjects and we think their work is interesting, and that they don't necessarily live in Munich, but we prefer and love to receive talk proposal from the same people that attend the events .

The second meetup at Stylight.

Using these guidelines as basis we have successfully organized the first two meetings. TrustYou (the company where Florian works) sponsored the first meetup. They provided the location and the food/drinks. In this first meeting we had two interesting talks one coming from the university and the other one from the industry. Han Xiao from the chair of computer security of TUM, shared with us his research in the field of adversarial machine learning and Jan Stepien shared his experiences analyzing his personal internet navigation data.

We were really happy with the results of the first meetup. As I mentioned before than 45 people attended the meetup (during sunny afternoon in Munich which makes it even more significant). We had a really spacious room with top notch projector and sound and in my personal view both talks where really good and understandable (even when Han showed his math mojo in order to describe properly a problem).


Food for thought and for the body

The second meetup was sponsored by Stylight. They have an awesome office with at an excellent location. This time we had three talks, one about MongoDB by Dimitar, another about Event Processing by Alex and one about Real Time Robot control using Neural Networks by Justin. Besides some minor problems with the projector the Meetup went smoothly and we had really good feedback from the attendants.

You can find all the slides of the talks on the Speakerdeck account we created for storing them.

For the next meeting we have some nice ideas that we want to try and and really cool talks as well. We will be announcing the next meetup shortly, I think it will be shortly after Oktoberfest (Wiesn). It will be probably in the brand new Gini office but I will confirm this soon. We are constantly looking for sponsors, so if your company would like to sponsor our meetup please just drop us a line.

If you are reading this (and you happen to be in Munich) and would like to attend or present in our group please sign in in our Meetup.com page and send us an e-mail (the e-mail is necessary only if you want to present). We will love to have more DataGeeks to share knowledge with.

UPDATE

I forgot to mention the origin of the logo. As neither I or Florian are designers, I asked my siter, who runs along with ther husband a design agency, to take an hour from her busy schedule to design something. Well, she did it and you can see the result on top of this page. Thanks Sis :)

Hiking in the Alps

One of the things I like about my my university is that they have a really cool international office: the TUMi. Among other things, TUMi organizes activities and trips to nearby cities. These activities are generally targeted at international / exchange students, with the purpose of allowing them to get to know the culture of Bavaria and Germany.

Normally I don't attend TUMi events as much as I would like. However, this time there was a Hiking trip to the Bavarian Alps. Since I arrived in Germany I have not had the opportunity of experiencing this part of the Bavarian geography during summer time. I was not really sure what to expect from the trip but I have to say that it was an amazing experience. I had the chance to enjoy the Bavaria landscapes and, at the same time, to meet really nice people.


The Soiern group of mountain from the distance.

So, the idea of the trip was to hike to the Soiernhaus, a small cottage next to the Soiernspitze, the highest peak of the the Soiern group. A group of mountains in the south of Germany. Once there, we wanted to spend the night there and the following day return, but this time by going to the top of the mountain and going down to the valley.

It might not sound that hard, but it was my first time hiking in the mountains, and therefore I did not have the appropriate equipment for such a adventure (No hiking shoes but sneakers, not good jacket, and so on). However, we all managed to get there without major problems and only with some sore muscles the day after as a result of the long walk.


A view of the Soiern group with the Soiernsee (Soiern lake)

To get there we took a train from Munich Hauptbahnhof (central station) to a station close to Garmisch-Partenkirchen. From there we headed to a nearby town called Krün, and then we started going up through the mountains. The hike was good. It had a really challenging part, due to its steepness. As most of us were not "fit", we had some breaks along the way. During one of those breaks the guide took out some speakers from his backpack and started playing some Bavarian traditional music, very suitable for the occasion I would say. It was a bit surreal walking up the mountains and at the same time listening to this type of music. To give you an idea of the overall route, below you can see the GPS data of the first day.

After arriving to the cottage we could finally have some rest. Some of us went to the lake but sadly it started raining almost immediately and we quickly rushed back. After the rain stopped we had the chance of watching beautiful sunset. We spent the evening inside cottage having some bier and eating some traditional food. We had funny talk about how to say "Spooning" in different languages (it looks like there's not word for that in Bulgarian) and we played some cards. Some people went to bed quite early but others, like me, stayed a little bit late enjoying the conversation.


View from the Soiernspitze

The next day some of our group decided to go a do a morning hike to a nearby peak. I was too tired and I decided to sleep a bit more (yeah, yeah, I am lazyass). Around 8:00 AM we had breakfast and we started to walk again. The locidea was to cross the main mountain (we went around the first day) and go down through the valley in order to reach Mittenwald, where we would take the train back to Munich.


Hiking/Climbing the Soiernspitze

This was pretty tough. The path, although easy too find, was hard and steep. It had also many snow patches of snow making it slippery. That is not good at all when you high on a mountain walking on a narrow path. I can say I felt scared at times, because I was not wearing good equipment. Luckily, nothing happened at then end and we could enjoy the great view from the top of the Soiernspitze while having fun going down the mountain. By the end of the day though, we were exhausted and we were only looking forward to catch the train. Below the GPS data for the second day.

This was a really unique experience for me. I did not know how popular the Hiking paths of the south of Bavaria were and that hiking was such a popular activity among Germans. I fell in love with the Bavarian landscapes and the imposing mountains of the Bavarian Alps. I would like to thank the TUMi crew for such a great opportunity and all the group of "TUMis" for being such a great bunch of people to hike with. Now I am just thinking when and where my next hike is going to be. I am really looking forward to going south again and exploring another part of the beautiful Bavarian Alps geography.

New Emacs configuration

This year I wanted to blog more, but it is really hard to focus on writing while working, studying, learning German and taking Coursera courses (more on that hopefully shortly in a another post).

For now I just wanted to share that after years of having the same configuration of Emacs, I changed my configuration. I had been using the same configuration for some years adding stuff on top of it. It was slow and full of bugs. Today I discovered that my Python+Yasnippet+AC stopped working for unknown reasons and that made me realize it was time to create a new configuration.

I also changed the Color theme to "Misterioso" and I am linking it a lot. I am not using all the packages I was using before. In particular, for now I have stopped using Emacs Code Browser but probably I will install it soon.

You can find more details and all the configuration files on the new repo I have created to track the changes to I will continue making to it: https://github.com/mfcabrera/dotemacs

Below a screeshot taken while I was this blog and coding some Python :)