2014-06-04

Guest post by John Moran

As both a translator and a software developer, I have much respect for the sophistication of the well-known proprietary standalone CAT tools like memoQ, Trados, DejaVu and Wordfast. I started with Trados 2.0 and have seen it evolve over the years. To greater and lesser extents these software publishers do a reasonable job at remaining interoperable and innovating on behalf of their main customers - us translators. Kudos in particular to Kilgray for using interoperability standards to topple the once mighty Trados from its monopolistic throne and forcing SDL to improve their famously shoddy customer support. Rotten tomatoes to Across for being a non-interoperable island and having a CAT tool that is unpopular with most (but curiously not all) of the freelance translators I work with in Transpiral.

But this piece is about OmegaT. Unlike some of the other participants in the OmegaT project, I became involved with OmegaT for purely selfish reasons. I am currently in the hopefully final stage of a Ph.D. in computer science with an Irish research institute called the Centre for Next Generation Localisation (www.cngl.ie). I wanted to gather activity data from translators working in a CAT tool for my research in a manner similar to translation process research tool called TransLog. My first thought was to do this in Trados as that was the tool I knew best as a translator but Trados’ Application Programming Interface did not let me communicate with the editor.

Thus, I was forced to look for an open-source CAT tool. After looking at a few alternatives like the excellent Virtaal editor and a really buggy Japanese one called Benten I decided on OmegaT. 

Aside from the fact that it was programmed in Java, a language I have worked with for about ten years as a freelancer programmer, it had most of the features I was used to working with in Trados.  I felt it must be reliable if translators are downloading it 4000 times every month. That was in 2010. Four years later that number is about to reach 10,000. Even if most of those downloads are updates, it should be a worrying trend for the proprietary CAT tools. Considering SDL report having 135,000 paid Trados licenses in total - that is a significant number.

Having downloaded the code, I added a logging feature to it called instrumentation (the “i” in iOmegaT) and programmed a small replayer prototype. Imagine pressing a record button in Trados and later replaying the mechanical act of crafting the translation as a video, character-by-character or segment-by-segment, and you will get the picture. So far we use the XML it generates mainly to measure the impact of machine translation on translation speed relative to not having MT. Funnily enough, when I developed it I assumed it would show me that MT was bunk. I was wrong. It can aid productivity, and my bias was caused by the fact that I had never worked with useful trained MT. My dreams of standing ovations at translator association meetings turned to dust.

If I can’t beat MT I might as well join it. About a year and a half ago, using a government research commercialization feasibility grant, I was joined by my friend Christian Saam on the iOmegaT project. We studied computational linguistics in Ireland and Germany on opposite sides of an Erasmus exchange programme, so we share a deep interest in language technology and a common vocabulary. We set about turning the software I developed in collaboration with Welocalize into a commercial data analysis application for large companies that use MT to reduce their translation costs.

However, MT post-editing is just one use case. We hope to be able to use the same technique to measure the impact of predictive typing and Automatic Speech Recognition on translators. I believe these technologies are more interesting to most translators as they impose less on word order.

At this point I should point out that CNGL is a really big research project with over paid 150 researchers in areas like speech and language technology. Localization is big business in Ireland. My idea is to funnel less commercially sensitive translator user activity data securely, legally, transparently and, in most cases anonymously from translators using instrumented CAT tools into a research environment to develop and, most importantly, test algorithms to help improve translation productivity. Someone once called it telemetry for offline CAT tools. My hope is that though translation companies take NDAs very seriously, it is also a fact that many modern content types like User Generated Content and technical support responses appear on websites almost as soon as they are written in the source language, so a controlled but automated data flow may be feasible. In the future it may also be possible to test algorithms for technologies like predictive typing without uploading any linguistic data from a working translator’s PC. Our bet is that researchers are data-tropic. If we build it they will come.

We have good cause to be optimistic. Welocalize, our industrial partner, is an enlightened kind of large translation company. They have a tendency to want to break down the walls of walled gardens. Many companies don’t trust anything that is free, but they know the dynamics of open-source. They had developed a complex but powerful open-source translation memory system called GlobalSight, and its timing was precipitous.

It was released around the same time SDL announced they were mothballing their TMS system to replace it with the newly acquired Idiom WorldServer (now SDL WorldServer). This panicked a number of corporate translation buyers, who suddenly realized how deeply networked their translation department was via its web services and how strategically important the SDL TMS system was. As the song goes, "you don’t know what you’ve got till its gone" – or, in this case, nearly gone.

SDL ultimately reversed the decision to mothball TMS and began to reinvest in its development, but that came too late for many corporates who migrated en-masse to GlobalSight. It is now one of the most implemented translation management  systems in the world in technology companies and Fortune 500’s. A lot of people think open-source is for hippies, but for large companies open-source can be an easy sell. They can afford engineering support, department managers won’t be caught with their pants down if the company doing the development ceases to exist, and most importantly their reliance on SDL’s famously expensive professional services division is reduced to zero. If they need a new web-service, they can program it themselves. GlobalSight is now used in many companies who are both customers of Welocalize and companies like Intel who are not. Across should pay heed. At a C-Suite level corporates don’t like risk.

However, GlobalSight had a weakness. Unlike Idiom WorldServer it didn’t have its own free CAT tool. Translators had a choice of download formats and could use Trados but Trados licenses are expensive and many translators are slow to upgrade. Smart big companies like to have as much technical control of their supply-chain as possible so Welocalize were on the lookout for a good open-source CAT tool. OpenTM2 was a runner for a while but it proved unsuitable. When I worked with Welocalize as an intern I saw wireframes for an XLIFF editor on the wall but work had not yet started. Armed with data from our productivity tests and Didier Briel, the OmegaT project manager, who was in Dublin to give a talk on OmegaT, I made the case for integrating OmegaT with GlobalSight. It was a lucky guess. Two years later it works smoothly and both applications benefit from each other. What did I have to gain from this?

Data.

So why this blog? Next week I plan to present our instrumentation work at the LocWorld tradeshow and I want Kilgray to pay heed. OmegaT is a threat to their memoQ Translator Pro sales and that threat is not going to reduce with time. Christian and I have implemented a sexy prototype of a two-column working grid, and we can do the same trick importing SDL packages with OmegaT as they do with memoQ. Other large LSPs are beginning to take note of OmegaT and GlobalSight.

However, I am a fan of memoQ, and even though the poison pill has been watered down to homeopathic levels, I also like Kilgray’s style. The translator community has nothing to gain if a developer of a good CAT tool suffers poor sales. This reduces manpower for new and innovative features. Segment-level A/B testing using time data is a neat trick. The recent editing time feature is a step in the right direction, but it could be so much better. The problem is that CAT tools waste inordinate amounts of translator time, and the recent trend towards CAT tools connected to servers makes that even worse. Slow servers that are based on request-response protocols instead of synchronization protocols, slow fuzzy matches, bad MT, bad predictive typing suggestions, hours wasted fixing automatic QA to catch a few double spaces. These are the problems I want to see fixed using instrumentation and independent reporting.

So here is my point in the second person singular. Kilgray – I know you read this blog. Listen! Implement instrumentation and support it as a standard. You can use your terminal server web application to report on the data or do it in memoQ directly. On our side, we plan to implement an offline application and web-application that lets translators analyse that data by manually importing it so they can see exactly how much they earn per hour for each client in any CAT tools that implement that standard. €10 says Trados will be last. A wise man once said you get the behavior you incentivize, and the per-word pricing model incentivizes agencies to not give a damn about how much a translator earns per hour. The important thing is to keep the choice about sharing translation speed data with the translator but let them share it with clients if they want to.  Web-based CAT tools don’t give them that choice, so play to your strengths. Instrumentation is a powerful form of telemetry and software QA.

So to summarize: OmegaT’s place in the language services industry is to keep proprietary CAT tool publishers on their toes!

*******

See also the CNGL interview with Mr. Moran....

Show more