This is another post from Juan Rowda, this time with his colleague Olga Pospelova, that provides useful information on this subject. Human evaluation of MT output is key to understanding several fundamental questions about MT technology including:
Is the MT engine improving ?
Do the automated metrics (which MT systems developers use to guide development strategy) correlate with independent human judgments of the same MT output?
How difficult / easy is the PEMT output task?
What is the fair and reasonable PEMT rate?
The difficulty with human evaluation of MT output is closely related to the difficulty associated with any discussion of translation quality in general in the industry. It is challenging to come up with an approach that is consistent over time and across different people. Being scientifically objective is particularly a challenge. So, like much else in MT, estimates have to be made, and some rigor needs to be applied to ensure consistency over time and across evaluators. I will add some additional thoughts and references after Juan's post below, to provide a view on some new ways being explored to approach this issue of how humans could evaluate MT output in an objective way.
--------------
Machine translation (MT) output evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required postediting, negotiating the price, and setting reasonable expectations. As we discussed in our article on quality estimation, machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges. The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.
Human evaluation, however, also has a number of disadvantages. Primarily because, it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation.
Let us explore the most commonly used types of human evaluation:
Rating :-
Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).
The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.
Adequacy , according to the Linguistic Data Consortium, is defined as "how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation."
The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.
A typical scale used to measure accuracy is based on the question "How much meaning is preserved?"
5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none
Fluency refers to the target only, without taking the source into account; the main evaluation criteria are grammar, spelling, choice of words, and style.
A typical scale used to measure fluency is based on the question "Is the language in the output fluent?"
5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible
Ranking:-
Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation.
Error Analysis :-
Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are "missing words", "incorrect word order", "added words", "wrong agreement", "wrong part of speech", and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.
When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.
Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.
Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowd-sourcing evaluations tend to produce unreliable results. How then, can we save money without compromising the validity of our findings?
Start with a pilot test - a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
Monitor response patterns to remove judges whose answers are outside the expected range.
Use dynamic judgments - a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.
Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment.
Juan Rowda
Staff MT Language Specialist, eBay
https://www.linkedin.com/in/juan-martín-fernández-rowda-b238915
Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well.
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation.
------------------------------------------
So in this quest for evaluation objectivity, there are a number of recent initiatives that are useful. For those who are considering the use of Error Analysis, TAUS has attempted to provide a comprehensive error typology scheme together with some tools. These errors can then be used to develop a scoring system that will weight errors differently and does help to develop a more objective approach to a discussion on MT quality.
This presentation by Alon Lavie provides more details on the various approaches that are being tried and includes some information on the Edit Distance measurements which can only be done after a segment is post-edited.
This study is interesting since it provides evidence that monolingual editors can perform quite efficiently and consistently provided they are only shown Target data i.e. without the source. In fact they do this more efficiently and consistently than bilinguals who are shown both source and target. I also assume that an MT system has to reach a certain level of quality before this is true.
Eye-tracking studies are another way to get objective data and initial studies provide some useful information for MT practitioners. The premise is very simple: Lower quality MT output slows down an editor who has to look much more carefully at the output to correct it. It also shows that some errors cause more of a drag than others. In both charts shown below the shorter the bar the better. Three MT systems are compared here and it is clear that the Comp MT system has a lower frequency of most error types.
The results from the eye fixation times are consistent and correlated with automatic metrics like BLEU. In the chart below we see that Word Order errors cause a longer gaze time than other errors. This is useful information for productivity analysts who can use it to determine problem areas, or determine suitability of MT output for comprehension or post-editing.
The basic questions being addressed by all these efforts are all inching us forward slowly and perhaps some readers can shed light on new approaches to answer them. Adaptive MT and the Fairtrade initiative are also different ways to address the same fundamental issues. My vote so far is for Adaptive MT as the most promising approach for professional business translation, even though it is not strictly focused on static quality measurement.