This is an article from the Asia Online November 2011 Newsletter that provides useful advice for meaningful comparisons of MT engines and is authored by Dion Wiggins, CEO of Asia Online. So the next time somebody promises you a BLEU of 60, be skeptical, and make sure you get the proper context and assurances that it was properly done.
“What is your BLEU score?” This is the single most irrelevant
question relating to translation quality, yet one of the most frequently
asked. BLEU scores and other translation quality metrics greatly depend on many
factors that must be understood in order for a score to be meaningful. A BLEU score
of 20 in some cases can be better than a BLEU score of 50 or vice versa. Without
understanding how a test set was measured and other details such as language pair
and domain complexity, a BLEU score without context is not much more than a meaningless number. (For a primer on BLEU look here.)
BLEU scores and other translation quality metrics will vary based upon:
The test set being measured:
Different test sets will give very different
scores. A test set that
is out of domain will usually score lower than a test set
that is in the domain of
the translation engine being tested. The quality of the
segments in the test set should be gold
standard (i.e. validated as correct by humans). Lower quality test set data will give a less meaningful
score.
How many human reference translations were used:
If there is more than one
human reference
translation, the resulting BLEU score will be higher as there are
more opportunities for
the machine translation to match part of the reference.
The complexity of the language pair:
Spanish is a simpler language in terms
of grammar and structure
than Finnish or Chinese relative to English. Typically if the source or target
language is relatively more complex, the BLEU score will be lower.
The complexity of the domain:
A patent has far more complex text and structure
than a children’s story
book. Very different metric scores will be calculated based
on the complexity of the
domain. It is not practical to compare two different test
sets and conclude that
one translation engine is better than the other.
The capitalization of the segments being measured:
When comparing metrics,
the most common form of
measurement is Case Insensitive. However when publishing,
Case Sensitive is also
important and may also be measured.
The measurement software:
There are many measurement tools for translation
quality. Each may vary
slightly with respect to how a score is calculated, or the
settings for the measure
tools may not be set the same. The same measurement software
should be used for all
measurements. Asia Online provides Language Studio™ Pro free of charge and this software
measures the scores, for a given test set, for a variety of quality metrics.
It is clear from the above list of variables that a BLEU score number by itself
has no real meaning.
How BLEU scores and other translation metrics are measured
With BLEU scores, a higher
score indicates higher quality. A BLEU score is not a
linear metric. A 2 BLEU
point increase from 20 to 22 will be considerably more noticeable
than the same increase from
50 to 52. F-Measure and METEOR also work in this manner
where a higher score is also
better. For Translation Error Rate (TER), a lower score
is a better score. Language
Studio™ Pro supports all of these metrics and
can be downloaded for free.
Basic Test Set Criteria Checklist
The criteria specified by this checklist are absolute. Not complying with any of
the checklist items will result in a score that is unreliable and less meaningful.
Test Set Data should be very high quality: If the test set data are of low
quality, then the measurement delivered will not be reliable.
Test set should be in domain: The test set should represent the type of information
that you are going to translate. The domain, writing style and vocabulary should
be representative of what you intend to translate. Testing on out-of-domain text
will not result in a useful metric.
Test Set Data must not be included in the training Data:
If you are
creating an SMT engine,
then you must make sure that the data you are testing with
or very similar data are
not in the data that the engine was trained with. If the
test data are in the
training data the scores will be artificially high and will
not represent the level of quality that will be output when other "blind" data are
translated.
Test Set Data should be data that can be translated: Test
set segments should
have a minimal amount of
dates, times, numbers and names. While a valid part a segment,
they are not parts of
the segment that are translated; they are usually transformed
or mapped. The focus for a
test set should be on words that are to be translated.
Test Set Data should have segments that are at between 8 and 15 words in length:
Short segments will
artificially raise the quality scores as most metrics do not
take into account
segment length. Short segments are more likely to get a perfect
match of the entire
phrase, which is not a translation and is more like 100% match
with a translation
memory. The longer the segment, the more opportunity there is
for variations on what
is being translated. This will result in artificially lower
scores, even if the
translation is good. A small number of segments shorter than
8 words or longer than
15 words are acceptable, but these should be limited.
Test set should be at least 1,000 segments: While
it is possible to get a
metric from shorter test
sets, a reasonable statistic representation of the metric
can only be created when
there are sufficient segments to build statistics from.
When there are only a
low number of segments, small anomalies in one or two segments
can raise or reduce the
test set score artificially.Be skeptical of scores from test sets that only contain a few hundred sentences.
Comparing Translation Engines - Initial Assessment Checklist
Language Studio™ can be used for calculating BLEU, TER, F-Measure and METEOR
scores.
All conditions of the Basic Test Set Criteria must be met: If
any condition
is not met, then the
results of the test could be flawed and not meaningful or reliable.
Test set must be consistent:
The exact same test set must be used for comparison
across all translation
engines. Do not use different test sets for different engines.
Test sets should be “blind”:
If the MT engine has seen the test set before
or included the test set
data in the training data, then the quality of the output
will be artificially
high and not represent the true quality of the system.
Tests must be carried out transparently: Where
possible, submit the data
yourself to the MT
engine and get it back immediately. Do not rely on a third party
to submit the data. If
there are no tools or APIs for test set submission, the test
set should be returned
within 10 minutes of being submitted to the vendor via email.
This removes any
possibility of the MT vendor tampering with the output or fine
tuning the engine based
on the output.
Word Segmentation and Tokenization must be consistent: If
Word Segmentation
is required (i.e. for
languages such as Chinese, Japanese and Thai) then the same
word segmentation tool
should be used on the reference translations and all the
machine translation
outputs. The same tokenization should also be used. Language Studio™ Pro provides a
simple means to ensure all tokenization is consistent
with its embedded
tokenization technology.
Ability to Improve is More Important than Initial Translation Engine Quality
The initial scores of a machine translation engine, while indicative of initial quality,
should be viewed as a starting point for rapid improvement which is measured by
the test set and BLEU scores. Depending on the volume and quality of data provided
to the SMT vendor for training, the quality may be lower or higher. Most often, more important
than the initial quality is how quickly the translation engine quality improves.
Frequently a new translation engine will have gaps in vocabulary and grammatical
coverage. Other machine
translation vendors’ engines do not improve at all or merely
improve very little
unless huge volumes of data are added to the initial training
data. Most vendors
recommend retraining once you have gathered a volume
of additional data that is
at least 20% of the size of the initial training data that the
engine was trained on. Even
when this volume of data is added, only a small improvement
is achieved. As a result,
very few translation engines evolve in quality much further
than their initial quality.
In stark contrast, Language
Studio™ translation engines are created with millions
of sentences of data that
Asia Online has prepared in addition to the data that
the customer provides. The
translation engines improve rapidly with a very small
amount of feedback. It is
not uncommon to get a 1-2 BLEU score improvement with
as little as a few thousand
post-edited sentences. Language Studio has a unique
4 step approach that
leverages the benefits of Clean Data SMT and manufactures additional
learning data by directly
analyzing the edits made to the machine translated output.
Consequently, only a small
amount of post-edited feedback can improve Language Studio™
translation engine quality
quite considerably, and it can do so at speeds much faster
and with far less effort
than with other machine translation vendors.
Asia Online provides
complimentary Incremental Improvement Trainings to encourage
rapid translation
engine quality improvement with every full customization and also
offers additional
complimentary Incremental Improvement Trainings when word packages
are purchased,
greatly reducing Total Cost of Ownership (TCO).
An investment in quality at
the development stages of a translation engine impacts
and reduces the cost of post
editing directly, while increasing post editing productivity.
While the development of some rules,
normalization, glossary and non-translatable term work will assist
in the rate of improvement,
the fastest and most efficient way to improve Language
Studio™ engines is to post
edit the translations and feed them back into Language
Studio™ for processing. The
edits will be analyzed and new training data will
be generated, directly
addressing the primary cause of most errors. In other words,
just post editing as part of
a normal project will result in an immediate improvement.
Little or no other extra
effort is needed. By leveraging the standard post editing
process, the effort and cost
of improvement as well as the volume of data required
in order to improve is
greatly reduced.
Depending on the initial training data provided by the client, a small number of
Incremental Improvement Trainings are usually sufficient for most Language Studio™
translation engines to improve to a quality level approaching near-human quality.
Other machine translation
vendors are now also claiming to build systems based on
Clean Data SMT. Closer
investigation reveals that their definition of “cleaning”
is not the same as Asia
Online. Removing formatting tags is not cleaning data.
Language Studio™ analyzes
translation memories and other training data and
ensures that only the
highest quality in domain data from trusted sources is included
in the creation of your
custom engine. The result is that improvements are rapid.
Even with just a few
thousand segments edited, the improvements are notable. When
combined with Language
Studio™ hybrid rules and an SMT approach to machine
translation the quality of
the translation output can increase by as much as 10,
20 or even 30 BLEU points
between versions.
Comparing Translation Engines – Translation Quality Improvement Assessment
Comparing Versions:
When comparing improvements between versions of a translation
engine from a single
vendor, it is possible to work with just one test set, but
the vendor must ensure
that the test set remains “blind” and that the scores are
not biased towards the
test set. Only then can a meaningful representation of quality
improvement be achieved.
Comparing Machine Translation Vendors:
When comparing translation engine
output from different
vendors, a second “blind” test set is often needed to measure
improvement. While you
can use the first test set, it is often difficult to ensure
that the vendor did not
adapt its system to better suit and be biased towards the
test set and in doing so
delivering an artificially high score. It is also possible
for the test
set data to be added to engines training data which will
also bias the score.
As a general rule, if you
cannot be 100% certain that the vendor has not included
the first test set data or
adapted the engine to suit the test set, then a second
“blind” test set is
required. When a second test set is used, a measurement should
be taken from the original
translation engine and compared to the improved translation
engine to give a meaningful
result that can be trusted and relied upon.
Bringing It All Together
The table below shows a real
world example of a version 1 translation engine from
Asia Online and an improved
version after feedback. Additional rules were added
to the translation to meet
specific client requirements, which resulted in
considerable improvement in
translation quality. This is part of Asia Online’s standard
customization process.
Language Studio™ puts a very high level of control
in the customer’s hands
where rules, runtime glossaries, non-translatable terms
and other customization
features ensure the quality of the output is as close to
human quality and requires
the least amount of editing possible.
BLEU Score
Comparisons
Case Sensitive
Asia Online
V1
SMT
V2
SMT
V2
SMT +
Rules
Google
Bing
Systran
Reference 1
36.05
45.96
56.59
30.58
29.64
21.01
Reference 2
35.80
39.31
48.85
32.05
29.94
22.56
Reference 3
38.65
52.31
65.03
35.51
33.17
24.68
Combined References
50.45
66.52
80.48
44.58
41.65
30.26
Case Insensitive
Reference 1
41.30
52.65
59.25
32.18
31.49
22.49
Reference 2
41.01
45.32
51.24
33.67
31.64
23.88
Reference 3
43.99
58.97
67.49
37.15
35.01
25.92
Combined References
56.83
74.35
82.89
46.26
43.68
31.68
*Language Pair: English into French. Domain: Information
Technology.
It can be seen clearly from
the scores above that when all three human reference
translations are combined
the BLEU score is significantly higher and that the BLEU
scores vary considerably
between each of the human reference translations. The impact
of the improvement and the
application of client specific rules can also be seen,
raising the case sensitive
BLEU score from 50.45 to 80.48 (an increase of 30.03
in just one improvement
iteration). One interesting side effect of having multiple
human references is that it
is often possible to judge the quality of the human
reference also. In the
example above, the machine translation output is much closer
to human reference 3,
indicating a higher quality reference. The client later confirmed
that the editor who prepared
the reference was a senior editor and more skilled
than the other 2 editors who
prepared human reference 1 and 2.
A BLEU score, as with other
translation metrics, is just a meaningless number unless
it is established in a
controlled environment. Asking “What is your BLEU score?”
could result in any one of
the above scores being given. When controls are applied,
translation metrics can be
used both to measure improvements in a translation engine
and compare translation
engines from different vendors. However, while automated
metrics are useful, the
ultimate measurement is still a human assessment. Language Studio™
Pro also provides tools to
assist in delivering balanced, repeatable and meaningful metrics
for human quality
assessment.