Playing with word2vec and German

I know I had promised before to blog more often but I always find excuses for not doing it :).

I am right now writing my master thesis, which is good because is already about time to finish my M.Sc. program. After exploring current offerings my university, I finally made up my mind and decided to write it inside the company I am currently working for. This is actually cool, because allows me to try things with real world data. The main topic is going be around using deep learning for Natural Language Processing (NLP) tasks, in particular to improve the models the company currently uses. We might explore other interesting use cases as well.

My main advisor is going to be Christoph Ringlstetter, chief scientist at Gini and Research Fellow at LMU's Center for Information and Language Processing (CIS). However I am also advised by other cool people which agreed to help out and offer guidance.

As I mentioned the focus of my project is going to be Deep Learning applications for NLP tasks. The language of focus is going to be German (as most of the data in Gini is in German). That is also funny, because my German kind of sucks and Ich bin weit davon entfernt, fließend deutsch sprechen zu können. But I think is also good, the idea of Deep Learning isto learn features without knowing anything about the data in advance. Also, I will have the support from my team Gini and my advisor so I think it will be all right (plus a good opportunity to improve my German screaming skills).

As I mentioned before, I am going to be working in Deep Learning, which is currently one hottest fields in Machine Learning. Deep Learning focuses on learning high level abstract features from unlabeled data. These features can then be used to solve existing problems.

In the field of Natural Language Processing (NLP) a recent model called Word2Vec has caught the attention from practitioners by its "theoretically-not-so-well-founded-but-pragmatically-superior-mode". Released as an open source project, Word2Vec is an Neural Network Language model, developed by Tomas Mikolov and other guys at Google. It creates meaningful vector representation of words. Each of the component of the vector is somehow a similarity dimension which captures both syntactic and semantic information of a word. Check for an example usage and this presentation for a nice explanation.

For writing the article, I am going to use Python, which I've fallen in love with this year given the huge amount of tools for doing data science / machine learning related tasks. There are basically two ways we can use use/train word2vec wordvector from python: one is using the word2vec wrapper that Daniel Rodriguez developed. The other way is to use it through Gensim by Radim Rehurek.

iPython, Word2Vec and Gensim

As I mentioned I am interested in the behavior of the word representations with the German language so I trained word2vec using the 3x10^9 bytes of the a German Wikipedia dump. To train with the Wikipedia, we have to get the XML dumps and "clean" it from tags. To do that I adapted the script found at the end of this page to German. Basically replacing German "funky" characters. I uploaded the adapted version as a Gist.

As for the training parameters for this particular test I used the skip-gram model and called word2vec like this:

  time word2vec -train dewiki3e9.txt -output de3E9.bin -skipgram 5 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 -save-vocab defull-vocab.txt 

For the purpose of this blog, I am choosing Gensim because it is a native re-implementation in python and offer nice functionality alread, and for a nice interactive programming enviroment I used iPython Notebooks and embedded the HTML output it below.

So, stop talking and let's start coding:

In [1]:
#let's get Gensim. I am assuming you have successfully installed it

from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('../wordvecs/de3E9.bin',binary=True)

This takes time for this particular file. The vector file is almost 1GB and it has to be loaded in memory.
Once loaded the model we can some of experiments found in the the paper and see how this particular model performs.
One of the cool example if that you can take the vector representing 'king' add the vector of 'woman' and subtract the vector of 'man' and you will get vector which cosine distance is most similar to the vector representing 'queen'. Let's see if that is true for this model:

In [3]:
model.most_similar(positive=['koenig', 'frau'], negative=['mann'])
[('gemahlin', 0.72522426),
 ('gattin', 0.64882195),
 ('edgith', 0.64861459),
 ('koenigs', 0.64086556),
 ('vladislavs', 0.63747227),
 ('mitregentin', 0.63738412),
 ('koenigsgemahlin', 0.63574708),
 ('koenigin', 0.63131845),
 ('thronansprueche', 0.62454271),
 ('regentin', 0.62117279)]

Well it does not. But it does not surprise me. We do not have all the data available and the training parameters were chosen arbitrarily so no surprise that it does not work. However We got the word 'gemahlin' which is normally useful to refer to the wife of a King (consort). The word 'gattin' is also used for spouse. However we do see the word 'koenigin' and 'koenigsgemahlin' which is the translation for 'queen' and 'royal consort'. Let's see whats happen if I just add the words

In [17]:
model.most_similar(positive=['koenig', 'frau'])
[('gemahlin', 0.72934431),
 ('koenigin', 0.70212948),
 ('ehefrau', 0.67596328),
 ('gattin', 0.67325604),
 ('lieblingstochter', 0.66053975),
 ('maetresse', 0.65074563),
 ('nantechild', 0.64813584),
 ('koenigsgemahlin', 0.64198864),
 ('eadgifu', 0.6408422),
 ('gemahl', 0.64082003)]


Wow well, almost :) - Only adding 'frau' to 'koenig' gave me in the top positions both 'queen' and 'consort'.

As I live in Munich, we often go on Fridays to have a Weisswürstfrühstuck or a traditional Müncher/Bayerisch breakfast. It is basically White sausage, sweet mustard and pretzel (accompained with an optional Wiessbier or wheat beer). Let see if our Word2Vec model can differentiate the components of this delicious meal.

In [26]:
model.doesnt_match("wurst senf brezn apfel".split())

This actually worked pretty well. The model was able to capture that a 'apple' is not part of the traditional breakfast :)

On the referenced papers on word2vec web page they describe some task both semantic and syntactic. let's try one of those and see how it works. This question basically asks, 'berlin' is to 'deutschland' what 'london' is to 'england'. So basically, country capital relationships.

In [29]:
q = ["berlin", "deutschland", "london", "england"]
[('dorset', 0.55140525),
 ('london', 0.54855478),
 ('sussex', 0.54572964),
 ('cornwall', 0.54447097),
 ('suffolk', 0.54392934),
 ('essex', 0.53380001),
 ('oxfordshire', 0.51856804),
 ('warwickshire', 0.51826203),
 ('edinburgh', 0.51790893),
 ('surrey', 0.51409358)]

So, the top answer is 'dorset' which is a county of England way in the south. But the second one is actually London. So, not bad. As I mentioned above, this model was trained basically with default parameters and with not necessarily a big dataset (as the one in the paper) Therefore, the embeddings might not be as accurate as desired or capture all the information that we would like.

Well, this was just basic test with a basic model. I will continue trying different parameters and of course understanding the model and implementation a bit more. There also interesting question to be answered, like for example how to represent long documents using word vectors or matching phrases properly However I see a lot of potential for applications in NLP ranging from basic document classification to more complex named-entity recognition.


Comments powered by Disqus