Text Understanding from Scratch: Paper Summary

This is a new format in whihc I am going to try myself in the next couple of weeks – paper summaryes.

This is a paper by Xiang Zhang and Yann LeCun from NYU. Be sure to read the full paper after reading this summary. You can find the paper here.

In this paper, they demonstrated that Convolutional Neural Networks can be used for text understanding from character level inputs without the knowledge of words, phrases or semantic structure.

The norm for text understanding in maschine learning for the last decade, or so, has been tokenization. You would take your input text and tokenize it into words or phrases and than feed it into a model of some sorts. The most popular form of tokenization is at the word level – word2vec.

This type of a model would need a dictionary, be it of words or phrases. The problem with this is that ti limits the model to a narowly defined domain.

The reason why this paper is called “Text Understanding from Scratch”, is because Convolutional networks do not need to understand the language structure. They do not need to understand the word per se. Rather, they can work at the character level. Previously people have used ConvNets with word2vec, but because of the large scale of the dictionaries, it can become too computationally expensive due to the high dimension.

Key modules

The main component of the model is the temporal convolutional module.

If we have a discrete input function g(x) \in [1, l] \rightarrow \mathbb{R} and a discrete kernel function f(x) \in [1, k] \rightarrow \mathbb{R} , the convolution h(x) between f(x) and g(x) with stride d would be:

h(y) = \sum^{k}_{x=1} f(x)\cdot g(y\cdot d - x + c)

Here c = k – d +1 is an offset constant.

Just as in ConvNets in computer vision, the model is parameterised by a set of kernel functions f(x) called weights, a set of inputs g(x) and outputs h(x). The output is obtained by ay sum over i of the convplutions between f and g.

They also used temporal max-pooling, just like in computer vision, except that it is in 1-D. Given a discrete input function g(x), the max-pooling function m(y) would be

m(y) = max^{k}_{x=1} g(y\cdot d - x + c)

The non-linearity they used is the threshold function l(x) = max{0, x}. The learning algorithm was SGD with a minibatch of 128. Implementation was done using Torch 7.

Character quantization

Their model took as input a sequence od encoded characters. Given an alphabet of size m each character was quantitized using 1-of-m encoding. The blank spaces were all-zero vectors. They only took in a sequence o fixed length. Any character exceeding the lenght was cut. And also, they quantitized the character in backward order, so the latest characters read were near the begining.

You can see which characters they chose for they alphabet in their original paper, which you can find here.

Model design

They designed two networks – a large one and a small one. They consited of 6 convolutional layers and 3 fully-connected layers. This is the illustation of the model:

Illustration of the model

There are also two dropout layers in between the three fully-connected layers with the dropout probability of 0.5.

You can find the sizes of the convolutional layers used in their experiments in the original paper.

Data augmentation

Data augmentation in image or speech data pretty straight forward. For example when working in images, you could turn the photo upside down, or cut it in half. But text augmentation is a whole different story.

The text needs to still have the same meaning. And, of course, it needs to make sense when you read it. In a perfect world, you would have humans rewritting the thext, or rther rephrasing. But this is unrealistic because the body of text would be too large, and it would be too expensive.

Instead, they used mytheas, which was obtained from WordNet. With that, every synonym to a word or a phrase was ranked by teh semantic closeness. They randomly chose which part of the text, if the said part had a coresponding synonym, would be replaced with a synonym and which one if there were more then one.

Data and results

For comparisons they also created two standard models: a bag-of-words and a bag-of-centroids.

In the original paper, they went through five datasets. But here, I am going to just briefly go through the first two. I encourage you to go and read about all of the others.

The first dataset they used is DBpedia. It is a crowd-sourced community effort to extract structured information from Wikipedia.

The dataset was constructed by picking 14 non-overlapping classes from DBpedia 2014. They chose 40,000 training samples and 5,000 testing samples.

They found that their large network without the augmentation got a 99.96% accuracy on the training dataset and 98.27% on the test, while the bag of words model got 96.29% and 96.19%, and the word2vec model got 89.32% and 89.09%.

The second data set they used contained reviews from Amazon.com. Each review contained a rating from 1 to 5. They atempted to predict the rating of the product based on the review via sentiment analysys.

Along side the standard prediction 1-5, they also did a polarity test, converting grades 1 and 2 to negative and 3-5 to positive.

Of course, they found that their model was better at predicting the polarity, positive or negative, then the grade 1-5. The highest score they got on the full score dataset was 69.24%, but hte word2vec and bag-of-words didn’t do much better either. Bag-of-words got ~54% and word2vec ~36%.

On the other side, the polarity test hit the highs of 97.57%. Naturaly it was expected that the model would be better at predicting a binary result than a five point rating system.

Conclusion and review

I found this paper rather interesting. Though I am a novist in deep learning, I have always beeen thought that anything related to natural language processing is supposed to be done through RNNs and things of simmilar atributes.

I myself am far more formiliar with convolutional networks than recurrent ones, so naturaly this paper cought my attention. I have to say that I realy like it and I think that ithis approach could be used in production products because they use far less memory.