1. The Optimal Transport (OT) problem
2. Example : Word Mover Distance
Word Mover Distance (WMD)
Supervised Word Mover Distance
3. Classyfing Reuters documents - Python code
Now we propose to apply the previous machinery to the Reuters dataset available here. We use below the file r52-train-stemmed.txt available here. We first load the desire python module to perform matrix calculus, data manipulation, word embeddings, optimal transport and visualization. The Python doc for FastText is available here and details about the python module OT is presented here.
We are ready to load the dataset for training.
Then two lines of codes is sufficient to perform the word embeddings of each word in the corpus of documents.
We collect the word embeddings in a dictionnary of words and vectors and an embedding matrix where lines are vectors in dimension 100.
Wa can now compute the coast matrix associated with the OT problem from FastText word embeddings matrix X:
Next function is the core function of this experiment. It computes the WMD between two given documents A and B from a coast matrix M and a model from FastText.
Finally you just have to compute the WMD matrix from the set of documents in the train set as follows:
Here is what you should see !
I hope you enjoy this post. I hope you are convinced about the use of OT for Machine Learning purpose and the beauty of this theory. Of course there exists nicer implementation of Word Mover Distance (see gensim for instance) but the most important thing is to understand the mathematical foundations of WMD and how it is computed in your CPU/GPU ! And of course I hope it gives you a source of inspiration to apply OT for other particular problems ! If YES, maybe you should apply to our post-doc position :-)
We just sent you an email. Please click the link in the email to confirm your subscription!