2016-10-25

Author: Harsha Mudumby, Lead R&D and Machine Translation Engineer

Complex problems have always fascinated me. To me, the tougher the problem, the more challenging and fun it is to solve. There is a sense of completeness, when you encounter a problem, analyse it from a variety of perspectives and, finally, build a working solution for it.

Ever since I started my career, I have always solved problems for people on the other side of the planet. I used to wonder how I was positively affecting their lives. Were my solutions making their lives any simpler? Were they actually effective?

There was also one other question that was always nagging at the back of my head: what about my people, the ones on my side of the planet? Specifically, those around me.

Then Reverie Language Technologies happened to me. It combined two of my favourite things: languages and technology. Not only that. During my series of interviews before joining Reverie, I came to know about some of the hardest problems—those of the rampant language divide on the digital platform—they were trying to solve locally. I was instantly hooked.

If, like me, you look forward to a day when a kid, sitting in a remote village of India, can read Wikipedia in her own language and learn the world’s knowledge, then I would like to give you a tiny peek into some of the basic yet complex problems and solutions my team and I work with on a daily basis.

Let’s start with Word Vectors.

Numbers and Words

If you drill down any digital system, it finally reduces to numbers. Just two.

0 and 1.

While they may seem simple on the surface, you can use these digits to build bigger numbers, and you can perform a great deal of arithmetic on those bigger numbers. Suddenly, you have an extremely powerful system with a high level of abstraction!

Now, let us bring in language. We humans string together characters and make words. Put words together and you have sentences. Put together sentences and you have concepts and communication. Suddenly, again, you have an extremely powerful system with a high level of abstraction!

Except, computers don’t understand words. They understand numbers.

Yes, they store, process and manipulate words at a superficial level but can they wield arithmetic wizardry on words?

It naturally leads you to another question:

Can words be represented by numbers? And can we play with them just like we play with numbers?!

Word embeddings:

Vocab-based word vector: a simple approach

Let us use a primitive system called one-hot to begin with. Take a word. Any word. We mark a 1 for that word and 0 for all other words.

For instance, if we have a vocabulary of five words: [watch, is, on, the, table], each of them can be represented as:

watch    =    [1 0 0 0 0]

is    =    [0 1 0 0 0]

on   =    [0 0 1 0 0]

the   =    [0 0 0 1 0]

table   =    [0 0 0 0 1]

These are called vectors/embeddings. But there are a few problems with an approach like this:

The vectors grow linearly with vocabulary. If we have a vocabulary of 10000 words, we will end up with one 1 and 9999 0s for every word.

There is no mathematical similarity in the vectors of words that are semantically similar. For example, a vector of sport and game are likely to be very different from each other.

Corpus-based word vector: a slightly better approach

Now, instead of marking words as they appear in the vocabulary, we can count their occurences in the documents/articles they appear in.

Say we have a set of three sample documents labeled, [A B C], and we plot the occurences of the same five words [watch, is, on, the, table] across them.

It might look something like:

watch    =    [21 6 2]

is    =    [33 61 48]

on   =    [16 21 30]

the   =    [65 84 54]

table   =    [25 4 2]

Words like [the, is, on] are very common and occur across all three documents. But words like [watch] and [table] are less common. If we observe the vectors, we see that the vectors,

watch    =    [21 6 2]

table   =    [25 4 2],

from sample [A] look more similar to each other when compared with others. So we could make a guess that document [A] probably talks about physical objects.

This solves the previously mentioned problem of numeric similarity among similar words but these vectors also depend heavily on the documents we have in corpus and can get very large as corpus size increases.

A more refined corpus-based word embedding approach is TF-IDF (Term Frequency – Inverse Document Frequency). TD-IDF scores words not just by their occcurence in the specific document, but also by how often they appear across the corpus.

TF(w) = Occurrences of word w in a document/Total number of words in the document

IDF(w) = Total number of documents/Number of documents with word w in it

TFIDF(w) = TF(w) * IDF(w)

This is important because, even rare words, when repeated all over the corpus, are not very useful. For instance, the word government may be important in a mixed corpus but it is not that important in a corpus made up of only government reports.

SciKit has a neat TFIDF vectorizer that quickly generates vectors, given a corpus.

Language-model based word vector: A revolutionary approach

A more recent landmark advancement in this field is the building of dense vectors via word2vec.

word2vec employs nothing more than the context of the surrounding words to build the vector of a word. It uses a shallow neural net with either the skip gram or Continuous Bag-of-Words (CBOW) algorithm.

In simple words:

Skip gram asks which words (w1, w2, …, wn) surround a given word w. For example, given the word “football,” we can predict other words like [play, match, messi], and so on.

CBOW asks which word w fits in the context, given words w1, w2, …, wn. For example, given a context like [I, am, feeling], we can predict words like good, angry, happy, etc.

CBOW and skip gram. Source image: https://silvrback.s3.amazonaws.com/uploads/60a81cd5-5189-4550-9709-523b3feef3d1/sentiment_01_large.png

word2vec employs a shallow neural net whose input and output layers are made up of one-hot vectors (as discussed previously), and the hidden layer is what trains to become the final word vectors.

If you are interested in the details of how it uses neural nets, backpropagation and gradient-descent, this paper is a great reference.

word2vec architecture. Source image: http://i.stack.imgur.com/fYxO9.png

The resultant vector of a word like apple from a 50-dimensional model (length of hidden layer row) looks like this:

[0.52042 -0.8314 0.49961 1.2893 0.1151 0.057521 -1.3753 -0.97313 0.18346 0.47672 -0.15112 0.35532 0.25912 -0.77857 0.52181 0.47695 -1.4251 0.858 0.59821 -1.0903 0.33574 -0.60891 0.41742 0.21569 -0.07417 -0.5822 -0.4502 0.17253 0.16448 -0.38413 2.3283 -0.66682 -0.58181 0.74389 0.095015 -0.47865 -0.84591 0.38704 0.23693 -1.5523 0.64802 -0.16521 -1.4719 -0.16224 0.79857 0.97391 0.40027 -0.21912 -0.30938 0.26581]

This may be harder for us read but this captures a high similarity with other related words. But, more importantly, this approach is highly customizable (by dimensions and word-window) and doesn’t require anything more than a huge corpus. Fully unsupervised.

And these vectors capture some amazing real-world relationships!

Vector relationships. Source image: https://www.tensorflow.org/versions/r0.11/images/linear-relationships.png

So, now what?

Now that we have managed to convert words into numbers, the fun part begins!

We can use these vectors as starting features to many traditional Machine Learning algorithms and quickly build some very interesting and useful models.

Stay tuned for the next post in which we will look into clustering concepts with kNNs (nearest neighbors), classifying them using SVMs (support vector machines) and even build a sentiment analyzer with CNNs (convolutional net networks).

The post Inside the Mechanics of Machine Learning: Word Vectors (Part I) appeared first on Reverie Language Technologies.

Go.reverieinc.com

Inside the Mechanics of Machine Learning: Word Vectors (Part I)