2016-08-19

Author: Sanket Patil, Data Scientist, Reverie Language Technologies

Courtesy: Indian Express

Claude Shannon liked to play games when he wasn’t particularly busy building mathematical models.

In one of his party games (the Shannon Switching Game), he drew a graph that had two special nodes A and B. Each edge of the graph is either coloured or deleted—the initial state of each edge is decided randomly.

Two players, one named Cut and the other Short, make a move alternately. Short’s job is to colour an edge in each of his turns; he wins if he can establish a coloured path from A to B. Cut’s job is to prevent Short from doing so; if Cut manages to partition A and B, she wins.

The Shannon Switching Game is quite enthralling. There are several articles that discuss winning strategies, and their implementations can be found online.

Graph Theory (after which the Shannon Switching Game is modelled) is fascinating: many real world problems can be modelled as graph problems, with nodes representing concepts and edges encoding relations between nodes.

While I love graph theory, I’m here to discuss something that I love even more, something that’s richer, more expressive, and a whole lot more complex–language. Specifically, in this post and the next couple in this series, let’s discuss the statistical nature of languages, how we can build a language model, its uses, and a host of other pertinent questions.

Before we dive into that, let’s explore some history.

Information Entropy

Shannon particularly loved games involving languages. He was interested in understanding the underlying patterns in them. A pattern is essentially implied redundancy–more the discernible patterns, more the redundancy.

What’s an example of redundancy in language?

In English, wherever the character ‘q’ appears, you know that the next character is almost always ‘u’ (except in rare cases for words like Qatar, most of which are arguably not English words).

Shannon intuitively knew that languages have a lot of redundancy. He wanted to measure it because that would have implications in communication system design.

Communication systems are inherently unreliable (noisy channels). So, the lesser the information (or the greater the redundancy) one needs to communicate from point A, the better the ability to reproduce it at point B.

Shannon developed a way of measuring the amount of information or information content in a message, which came to be known as Shannon Entropy. It’s one of those powerful concepts that revolutionised communication (analogue then and digital now). While the mainstream formalism and usage of language models started only in the 1980s, it largely owes to Shannon’s work done in the 1940s.

Corpus Linguistics and Zipf’s Law

Corpus Linguistics is the study of languages by using language corpora that have been curated by researchers over time.

A corpus is assumed to be a representative sample of language usage in the real world or in a certain domain of the real world. Some example corpora are: Google Books Ngram Corpus, Corpus of Contemporary American English, and the Brown Corpus. Among other things, corpora are used to analyse languages, derive language rules and statistics, and build training sets for machine learning models.

Let’s first look at some terminology.

Corpus: A collection of text that we have collected in natural languages like English, Tamil, or Mandarin. It is a sequence of tokens.

Tokens: A token is a word or a punctuation symbol in the chosen language. For example, [‘to’, ‘be’, ‘or’, ‘not’, ‘to’, ‘be’, ‘-’, …] is a list of English tokens.

Type: A type is a distinct token (from the above list, the distinct ones are: [‘to’, ‘be’, ‘or’, ‘not’, ‘-’])

Vocabulary: A vocabulary is a set of all types or a set of distinct tokens in a language.

About 10 years ago, Google released a corpus of 1 trillion n-grams along with their frequencies that was built using all publicly available web pages. To put the above definitions in perspective, let’s look at some numbers. Google’s corpus has:

1 trillion tokens (size of corpus)

13 million types (size of vocabulary)

English has ~1 million dictionary words

10 most common types cover ~33% of the corpus

The token ‘the’ covers 2.2%

The top 1000 types cover ~66% of the corpus!

The top 100000 types cover ~98% of the corpus

The last set of statistics is particularly interesting. It tells us something about the distribution of words in a language.

Word frequencies in natural languages follow what is called as the Zipf’s Law, which falls under the class of power law distributions.

In simple terms, it states the following: the frequency of the most frequent word in a language will be twice as much as the frequency of the second most frequent word, and so on.

This implies that if ‘the’ covers 2.2% of the corpus, then the next most frequent word, ‘of’ covers about 1.1%, and the third most frequent word ‘and’ covers roughly 0.6%, and so on. If you plot this, we are looking at a curve that is exponentially decaying, or on a log-log scale, it’s a straight line.

Zipf’s law applies to characters and punctuations as well. If you have played scrabble, you will recall why letters like ‘E’ are given a low score, whereas letters like ‘Z’ get a higher score.

Natural Languages and Markov Chains

Actually, the idea of using statistics to model a language predates Shannon.

In the early part of the 19th century, Alexander Pushkin wrote his novel, Eugene Onegin, in its verse form. It created a bit of a confusion and controversy when it came to translating the novel: should the translation be a verse or prose?

The mathematician, Andrey Markov, had an altogether different question though: is it possible to compute frequencies from the text and use it to predict the next character in the sequence? This was circa 1913.

A few decades later, Shannon used Markov Chains to build the formalism we previously discussed. A Markov Chain is a chain of states and the transition probabilities between them. States could be weather conditions on subsequent days, or character sequences in Pushkin’s text. The Markov model gives us the probability that we will next move to state Y, given that we are in state X currently. The probabilities are derived by observing weather patterns, analysing text corpora, and so on.

Let’s do a small experiment. Can you read and understand the text below?

I cnduo’t bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid, aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno’t mttaer in waht oderr the lterets in a wrod are, the olny irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. The rset can be a taotl mses and you can sitll raed it whoutit a pboerlm. Tihs is bucseae the huamn mnid deos not raed ervey ltteer by istlef, but the wrod as a wlohe. Aaznmig, huh? Yaeh and I awlyas tghhuot slelinpg was ipmorantt! See if yuor fdreins can raed tihs too.

I know it’s greatly jumbled, but I am sure you could still make complete sense of it albeit with some effort.

Languages have statistical structure. Humans internalise a language’s statistical structure intuitively. We know what word or phrase to expect given what we have seen so far. We have great predictive capability when it comes to language. It’s not just couples who have been together for a long time that can complete each other’s thoughts and sentences. The rest of us can too. (It’s not love and understanding, my dear fellows, just intuition and statistics!)

Add to this awareness of a language’s idiom, the context of a conversation, and the famed human intuition. No wonder we have great cognitive abilities.

Now the question is: Can machines do that? Or what can they do with human languages?

Let’s begin to look at what’s possible in the next section.

Language Model

A statistical language model or simply a language model is a probabilistic mechanism of generating text from learned frequencies. We learn frequencies from history–typically in the form of large corpora, such as the Brown Corpus or the Google Corpus.

Language models learn n-gram frequencies from a corpus. An n-gram is a sequence of n tokens. The value of n is typically in the range of [1, 5].

Suppose the probability of occurrence of a token, W, is P(W). Now, for a sequence of n tokens, the n-gram probability is the conditional probability of that sequence.

We can use n-gram probabilities to address several questions in Natural Language Processing. Here is a sample:

What’s the probability that the current word is ‘the,’ given that the previous one is ‘of’? P(W_i = ‘the’ | W_i-1 = ‘of’)?

What’s the probability that the word following ‘Credit’ is ‘Card’ (written in transliterated Kannada)? P(W_i = ‘ಕಾರ್ಡ್’ | W_i-1 = ‘ಕ್ರೆಡಿಟ್’)?

You have received a Hindi message in Roman script like: kya kaha. Given that the first word is ‘ क्या’, is the second word ‘कहाँ’? Or is it ‘कहा’? Mathematically, which is greater? P(कहाँ | क्या) or P(कहा | क्या)

We can use character n-grams to predict the next character. Transliteration input: ‘kaha’. So, which is more likely, P(हाँ | क) or P(हा | क)?

There are several applications of language modes, including spelling corrections, word segmentation, code breaking (yes!), spam filtering, plagiarism detection, author identification, word prediction, and of course, statistical machine translation.

We will be discussing some of these in the next post.

Epilogue

I began this post with Shannon, who’s certainly a hero to me. Allow me to end this post by invoking him again. At the beginning of his 1948 seminal paper that laid the foundations for Information Theory, he wrote:

[…] The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; […]

I’ve always found the above–especially the second sentence–profound and amusing in equal measure. I can imagine Shannon saying, “They think what they say and write have meaning. Yeah, well, I couldn’t care less!”

In fact, he does write the following: “These semantic aspects of communication are irrelevant to the engineering problem.” And he’s right.

Shannon was trying to address the problem of communication, of encoding messages, such that they can be reliably passed on from point A to point B. He rightly abstracted away the meaning aspect of messages because that came later. Reliable communication and reproduction of messages had little to do with meaning of those messages.

The problems of communication are largely solved, but not those of language.

Paraphrasing the above, the fundamental problem for us is that of reproducing in one language (and/or script) either exactly or approximately a message selected in a different language (and/or script). Almost always the messages have semantics, meaning, context, intent, and a myriad other things!

So, what is the right level of abstraction to address problems in Transliteration and Machine Translation without getting bogged down by the bloody mess that languages are (especially English!)?

We briefly explored Language Models in this post. Let’s look at them in some more depth and explore other thinking tools (including those from Graph Theory) in subsequent posts.

References

A Mathematical Theory of Communication. C. E. Shannon, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.

Web 1T 5-gram Version 1.

Hidden Markov Models, Chapter 8, Speech and Language Processing. Daniel Jurafsky & James H. Martin.

The post Language Models: The Beginnings appeared first on Reverie Language Technologies.

Go.reverieinc.com

Language Models: The Beginnings