Siraj Raval’s video on how to make word vectors out of five A Song Of Ice And Fire books is a helpful demonstration of word embeddings, but not so helpful as a tutorial because he uses a bunch of smaller libraries which enable you to do the training of the model in just a couple lines of code. You can find that video here. As I said, that is a useful demonstration of what word embeddings are and what can they be used for, you do not really learn how to make word vector by your self.
That’s why I’m writing this post. I think that there is no point in learning those smaller libraries which have all of the code already set up for you. I am going to show you how to write his Game of Thrones word vectors program using Tensorflow.
If you don’t already know, Tensorflow is an open-source library used for machine learning. It was initially developed as an internal Google tool, but it has since been released for public use.
While there are a couple more widely used Python, like Pytorch, Theano etc., I like Tensorflow the best. That is basically the main reason why I’m using it over all the other libraries, but Tensorflow has proven itself as a stable tool which scales great.
Building the dataset
After we’ve read the source books (you can find the full source code here), we need to build a dataset which we can use to train a Tensorflow model.
We will need a dictionary (in which we use integers as keys, and strings (words) as values), a reverse dictionary (keys are strings, and values are integers) and the data from the books.
The count is a 2xn array, where one value is a word and the other is the number which represents how many times does that word appear in the text.
In order to make the model as accurate as possible, we need to segment the data we use for training. We can not just dump the entire dataset into a optimizer and hope that the result is accurate.
We use batches or sample data. In this case, we use batches of 128 words.
In this example, we use the skip-gram model. In the skip-gram model, the input is a single word, and the output should be a list of words which are similar to it. The number of words that we must predict is set by us.
We can check our batches by doing something like this:
Build a graph
Tensorflow uses graphs which contain of variables, tensors, ops etc.
We need three variables in order to make this model: weights, biases and embeddings. Weights and biases are standard stuff in deep learning. If you don’t already know what weights and biases are, go check out this post in which I write a simple neural network using only numpy.
Embeddings, on the other hand, are something new. Word embedding is a way in which a word is represented by a n-dimensional array. We do not feed the program the data itself, but the embeddings. We adjust embeddings just like we do with the weights and biases.
The training is the same as always. I did make the program choose a word and print eights most similar word at the time. This was repeated every 10 000 steps.
In order plot the words used in this program, we need to convert the words from 128 dimensions to just 2.
We do this through tSNE (t-Distributed Stochastic Neighbor Embedding). You can read more about this here.
Luckily, sklearn has a class for that. It is called TSNE.
After calling the method fit_transform, we get a list of two dimensional coordinates which we can plot in a two dimensional graph using matplotlib.
This example was inpired by Siraj Raval’s video on how to make word embeddings out out A Song Of Ice And Fire books, which you can find here, and examples from Tensorflow GitHub repo.
You can find the full source code for this example here.