E2f.com

IMUG: An Agreement About Our Words

2017-02-20

IMUG is the International Multilingual Group, a meetup and forum that lives at the heart of the trends, trends and community for multilingual computing. Founded in 1987 originally as a special interest group of the Stanford Macintosh Users Group (SMUG), it has been a Silicon Valley tradition now in its 30th year.

For this month’s most recent meeting, IMUG returned to the Apple campus for the first time since 2010 for a rare peek into the technology at the heart of Siri. A crowd of 125 linguists and technologists filed up the stairs and filled the “Garage 1” conference room in Apple’s Infinite Loop 4 building.

Entitled, “Let’s Come To An Agreement About Our Words,” last Thursday’s talk by Apple’s own George Rhoten gave a background into the complexities of linguistic modeling, and pulled back the curtain revealing the technology methods that allows Siri to make sense of our requests, and respond back to us in well-formed, human-sounding sentences.

Analyzing Linguistic Structure

The complexity of language is a difficult enough problem to solve in a single language, like English, never mind all the languages Siri can speak. For examples, George had us take look at various parts of speech. How about the shortest and most common words in English? “A” and “an” — our indefinite articles. Simple, right?

Not so fast! In English, though the general rule is to put “a” in front of words that begin with a consonant, and “an” in front of words with a vowel, there are many exceptions. For instance, we say “an umbrella,” but “a unicorn,” requiring the understanding of phonetic pronunciation (specifically, that the “u” in “unicorn” is pronounced with an initial “y-” consonant sound), beyond the simple rules of spelling. Or that “LED” is phonetically pronounced with an initial vowel (“el”), and thus, breaks the rules of words that begin with an initial consonant.

Definite articles in foreign languages can get even more complex, such as in French, with “l’,” “les,” “la” and “le” each requiring knowledge of the gender and plural nature of the nouns they proceed.

Much of the rest of the hour was further examples of complexity. All familiar to linguists, and often intuitively understood by native language speakers, but, to software developers working on natural language processing (NLP), it requires a great deal of analysis, forethought, planning, and… tons of exception handling!

Other elements to consider were listed off: gender, grammatical number, quantity, case, definitiveness, pronouns, adpositions, vowel properties.

Gender alone — grammatical gender that is, not biological or social definitions of it — is a complexity. While Romance language speakers are familiar with masculine and feminine gendered elements, there are other languages that have neuter or common gender forms (such as in Scandinavian languages). Also, what does gender associate with? The object (as in French, Italian, Spanish and German)? The speaker (as in Japanese, Thai and French)? The audience (Arabic and Hebrew)? A few rare languages even have animate vs. inanimate gender.

Numerics can also provide complexity, both in cardinal (1, 2, 3…) and ordinal (1st, 2nd, 3rd…) numbering. Never mind different ways to write equivalent numerics: “1st” or “first”, or Arabic vs. Roman numerals, and so on. He used the examples of Russian and Hebrew to point out how simple “counting” can have internal rules and exceptions.

Unicode: CLDR & ICU to the Rescue!

It was at this point George broached the topic of how modern computing systems deal (at least partially) with all this linguistic complexity: through the Unicode Common Locale Data Repository (CLDR) and International Components for Unicode (ICU).

CLDR serves to enumerate how to parse and interpret a large number of patterns. Everything from numbers, to plurals, to currencies, date and time, timezones, units of measurement, and so on. How to recognize character sets, scripts, writing direction, and languages. Countries and regions. Keyboard layouts. Even emojis.

CLDR comes as a set of XML files (formally known as Locale Data Markup Language) that codify these rules. It is at the heart of Apple’s iOS and MacOS, Microsoft Windows, Google Android and Chrome, and many more vendors and systems.

ICU is a set of C/C++ and Java libraries that provide Unicode and globalization support for software applications. Collating strings, formatting, regular expression handling, left-to-right or right-to-left writing rules, and how to bound text (positioning of words, sentences and paragraphs) are all defined in these libraries.

Basically, if CLDR defines the rules, ICU is code that enables them to be enacted.

There are still gaps and limits to CLDR. It is, like many standards, still a work in continuous progress with many opportunities for expansion (especially for low-resource languages).

Representing Multilingual Requests in a Message

So how does this all work, really? Let’s go back to simple programming: printf statements.

So, let’s say you had %d messages, but %d was only 1. Shouldn’t it say “1 message”? Yeah, that’s a problem. So you abstract it to separate the count number (1, 2… n) from the counted nouns (“message/messages”). Even then, it’s a trick to match the noun to the number.

What about GNU gettext? Geh! Unhandled gender and case problems. ngettext is also complicated for phrases with multiple quantities, like currencies (dollars and cents) or multiunit lengths (inches and feet).

George then showed a different way to solve it, using MessageFormat, a JavaClass which can be a bit more sophisticated than our old friends printf and gettext.

Now we can correctly choose the plural form of a noun for a number. At least in English. It will still fail with various languages, like Russian, unless we also bring in PluralFormat from the ICU library. So next George explained how ICU can use SelectFormat, PluralFormat, and MessageFormat together to successfully match the linguistic case to its representation in multiple languages. It still requires some knowledge, such as gender (which isn’t uniform between languages). But we’re now starting to really be able to model all the parameters successfully. Or so it seems.

Again, just when you think you have everything going your way, linguistics can keep complicating your life. Oral pronunciation, or even culture-bound information, can make all the difference. For instance, for the literal string “10/11/12,” are we referring to a date in the US? If so, then we can pronounce that as “October Eleventh, Twenty-Twelve.” If it were in Europe, though, we might formally read it as “Tenth of November, Two Thousand and Twelve.” Or, mathematically, maybe it really is just “ten divided by eleven divided by twelve.” Context matters.

Even in the existing frameworks, detecting such context isn’t possible. Gender detection doesn’t exist. Pronunciation detection doesn’t exist. Grammatical number support is hit-or-miss. The frameworks presume the developer can provide the context.

Siri. How does it work?

Siri employs a Unified Expression Language, based off JSP syntax (q.v. API defined in javax.el)

So what can you do with all this? Well, how about numbers. A page on Unicode.org (http://st.unicode.org/cldr-apps/numbers.jsp) for a Number Format Tester gives a pragmatic example of how to use CLDR to map the way to count, cardinally and ordinally, in various languages.

For instance, consider counting in English, which results in a 7-column-width table:

While some of these columns might even seem redundant, remember that for numbering-years, you won’t see much difference until you pronounce a year as “Twenty-seventeen” rather than “Two thousand and seventeen.” Even so, the listing is relatively straight-forward.

Compare that to the complexity of Russian:

And this is only a fraction of the columns in the table. The full-width 50-column-width table of Russian looks more like this (and yes, the tiny text is rendered unreadable in this image):

These sorts of resources are what you get built-in with CLDR. You’ll need more for a full framework for an app like Siri (or one you might build yourself).

George laid out the requirements: memory (potentially lots of it), structured dictionary/lexicons, inflection tables, grammatical properties. Heuristics to guess when you encounter new or novel words. Machine learning can be used — but only when you have large enough data sets. And in cases where you can’t figure it out, you need methods to fail gracefully.

As the presentation went on, minds began to wander towards other domains where such NLP can apply beyond voice recognition. George confirmed there were multiple other use cases.

Imagine a database search. This sort of lexical understanding could help you find all instance of “shoe/shoes,” as well as “zapato/zapatos,” “scarpa/scarpe,” “chaussure/chaussures” and so on, without having to enumerate all variants and translations. It could also be used to generate grammatically-correct data, or as a guide for training machine translation, or improve autocomplete algorithms.

As an example of how this sort of linguistically-correct structured system could be used, George concluded his presentation with a demonstration of an app built using these underlying systems. It used, as a variable, ${noun.withIndefArticle}, which then resolved to “an auto repair shop,” or “a restaurant” successfully. He also threw in “uncle” and “unicorn,” which successfully resolved to “an uncle” and “a unicorn,” proving that it was phonetically differentiating.

This is only the equivalent of a “Hello world” for indirect articles. He then showed how other grammatical handling worked, including this complex example of numerics for Finnish:

Afterwards followed an engaging questions-and-answers session. There is a lot already known about how language works, such as verb structure. There is still much else to do, and still ways to fool such rules-based systems. For instance, an audience member brought up a case if a capitalized noun in a title really shouldn’t act or be treated grammatically like a proper noun. All true, George conceded.

My own observation was along these lines: it seems like the industry began long ago with statistical models for understanding rules and structure of language. Then abandoned that approach to a degree in favor of statistical analysis for training machine translation systems. And now it’s on to deep learning. But here, the demonstration and talk was all about getting back to understanding the structured rules of language again. Yes, basically, George confirmed.

The main takeaway George hoped to emphasize was that this isn’t just internal proprietary work for Apple. He is eager to engage with other technologists on further developments for Unicode, CLDR and ICU. This was his initial presentation to a public audience, and he hopes to find peers and collaborators to take such developments further.

It was a rare and wonderful treat to have Apple open up its doors to host IMUG. Definitely one of the events that make you fully appreciate being located right in the heart of Silicon Valley.

How about for you? Are you already working deep in the guts of the Unicode standard? Is CLDR and ICU part and parcel of your day job, or is this all new to you? We’d love to hear your opinions on how this might apply to your own internationalization engineering work and plans. Send us an email at info@e2f.com, and let us know your thoughts. And of course, if this is right up your alley, you might want to get in touch with George Rhoten at Apple.