Entire books have been devoted to discussing what makes a translation a good translation. Relevant factors range from whether translation should convey emotion as well and above meaning, to more down-to-earth questions like the intended use of the translation itself.
Restricting our attention to machine translation, there are at least three different tasks which require a quantitative measure of quality:
1. assessing whether the output of an MT system can be useful for a specific application (absolute evaluation);
2. (a) comparing systems with one another, or similarly (b) assessing the impact of changes inside a system (relative evaluation);
3. in the case of systems based on learning, providing a loss function to guide parameter tuning.
Depending on the task, it can be more or less useful or practical to require a human intervention in the evaluation process. On the one hand, humans can rely on extensive language and world knowledge, and their judgment of quality tends to be more accurate than any automatic measure. On the other hand, human judgments tend to be highly subjective, and have been shown to vary considerably between different judges, and even between different evaluations produced by the same judge at different times.
Whatever one’s position is concerning the relative merits of human and automatic measures, there are contexts—such as (2(b)) and (3)—where requiring human evaluation is simply impractical because too expensive or time-consuming. In such contexts fully automatic measures are necessary.
A good automatic measure should above all correlate well with the quality of a translation as it is perceived by human readers. The ranking of different systems given by such a measure (on a given sample from a given distribution) can then be reliably used as a proxy for the ranking humans would produce. Additionally, a good measure should also display low intrasystem variance (similar scores for the same system when, e.g., changing samples from the same dataset, or changing human reference translations for the same sample) and high intersystem variance (to reliably discriminate between systems with similar performance). If those criteria are met, then it becomes meaningful to compare scores of different systems on different samples from the same distribution.
Correlation with human judgment is often assessed based on collections of (human and automatic) translations manually scored with adequacy and fluency marks on a scale from 1 to 5. Adequacy indicates the extent to which the information contained in one or more reference translations is also present in the translation under examination, whereas fluency measures how grammatical and natural the translation is. An alternate metric is the direct test of a user’s comprehension of the source text, based on its translation (Jones et al., 2006).
A fairly large number of automatic measures have been proposed, as we will see, and automatic evaluation has become an active research topic in itself. In many cases new measures are justified in terms of correlation with human judgment. Many of the measures that we will briefly describe below can reach Pearson correlation coefficients in the 90% region on the task of ranking systems using a few hundred translated sentences. Such a high correlation led to the adoption of some such measures (e.g., BLEU and NIST scores) by government bodies running comparative technology evaluations, which in turn explains their broad diffusion in the research community. The dominant approach to perform model parameter tuning in current SMT systems is “minimum error–rate training” (MERT; see section 1.5.3), where an automatic measure is explicitly optimized.
It is important to notice at this point that high correlation was demonstrated for existing measures only at the system level: when it comes to the score given to individual translated sentences, the Pearson correlation coefficient between automatic measures and human assessments of adequacy and fluency drops to 0.3 to 0.5 (see, e.g., Banerjee and Lavie, 2005; Leusch et al., 2005). As a matter of fact, even the correlation between human judgments decreases drastically. This is an important observation in the context of this book, because many machine learning algorithms require the loss function to decompose over individual inferences/translations. Unlike in many other applications, when dealing with machine translation loss functions that decompose over inferences are only pale indicators of quality as perceived by users. While in document categorization it is totally reasonable to penalize the number of misclassified documents, and the agreement between the system decision on a single document and its manually assigned label is a very good indicator of the perceived performance of the system on that document, an automatic score computed on an individual sentence translation is a much less reliable indicator of what a human would think of it.
Assessing the quality of a translation is a very difficult task even for humans, as witnessed by the relatively low interannotator agreement even when quality is decomposed into adequacy and fluency. For this reason most automatic measures actually evaluate something different, sometimes called human likeness. For each source sentence in a test set a reference translation produced by a human is made available, and the measure assesses how similar the translation proposed by a system is to the reference translation. Ideally, one would like to measure how similar the meaning of the proposed translation is to the meaning of the reference translation: an ideal measure should be invariant with respect to sentence transformations that leave meaning unchanged (paraphrases). One source sentence can have many perfectly valid translations. However, most measures compare sentences based on superficial features which can be extracted very reliably, such as the presence or absence in the references of n-grams from the proposed translation.
As a consequence, these measures are far from being invariant with respect toparaphrasing. In order to compensate for this problem, at least in part, mostmeasures allow considering more than one reference translation. This has the effectof improving the correlation with human judgment, although it imposes on theevaluator the additional burden of providing multiple reference translations.

Photo by Plamen Invanov©
Comments
No comments yet. Be the first to comment.
Would you like to comment?