Преводи

Translation agency

Automatic Evaluation Metrics

Created on: 2011-06-15 08:57:52

 

Levenshtein-Based Measures
A first group of measures is inherited from speech recognition and is based on computing the edit distance between the candidate translation and the reference. This distance can be computed using simple dynamic programming algorithms.
Word error rate (WER) (Niesen et al., 2000) is the sum of insertions, deletions, and substitutions normalized by the length of the reference sentence. A slight variant (WERg) normalizes this value by the length of the Levenshtein path, i.e., the sum of insertions, deletions, substitutions, and matches: this ensures that the measure is between zero (when the produced sentence is identical to the reference) and one (when the candidate must be entirely deleted, and all words in the reference must be inserted).
Position-independent word error rate (PER) (Tillmann et al., 1997b) is a variant that does not take into account the relative position of words: it simply computes the size of the intersection of the bags of words of the candidate and the reference, seen as multi-sets, and normalizes it by the size of the bag of words of the reference.
A large U.S. government project called “Global Autonomous Language Exploitation”(GALE) introduced another variant called the translation edit rate (TER)(Snover et al., 2006). Similarly to WER, TER counts the minimal number of insertion, deletions, and substitutions, but unlike WER it introduces a further unit-cost operation, called a “shift,” which moves a whole substring from one place to another in the sentence.
In the same project a further semiautomatic human-targeted translation edit rate (HTER) is also used. While WER and TER only consider a pre-defined set of references, and compare candidates to them, in computing HTER a human is instructed to perform the minimal number of operations to turn the candidate translation into a grammatical and fluent sentence that conveys the same meaning as the references. Not surprisingly, Snover et al. (2006) show that HTER correlates with human judgments considerably better than TER, BLEU, and METEOR (see below), which are fully automatic.
N-Gram–Based Measures
A second group of measures, by far the most widespread, is based on notions derived from information retrieval, applied to the n-grams of different length that appear in the candidate translation. In particular, the basic element is the clipped n-gram precision, i.e., the fraction of n-grams in a set of translated sentences that can be found in the respective references.
BLEU (Papineni et al., 2002) is the geometric mean of clipped n-gram precisions for different n-gram lengths (usually from one to four), multiplied by a factor (brevity penalty) that penalizes producing short sentences containing only highly reliable portions of the translation.
BLEU was the starting point for a measure that was used in evaluations organized by the U.S. National Institute for Standards and Technology (NIST), and is thereafter referred to as the NIST score (Doddington, 2002). NIST is the arithmetic mean of clipped n-gram precisions for different n-gram lengths, also multiplied by a (different) brevity penalty. Also, when computing the NIST score, n-grams are weighted according to their frequency, so that less frequent (and thus more informative) n-grams are given more weight.
The Importance of Recall
 BLEU and NIST are forced to include a brevity penalty because they are based only on n-gram precision. N-gram recall was not introduced because it was not immediately obvious how to meaningfully define it in cases where multiple reference translations are available. A way to do so was presented in Melamed et al. (2003): the general text matcher (GTM) measure relies on first finding a maximum matching between a candidate translation and a set of references, and then computing the ratio between the size of this matching (modified to favor long matching contiguous n-grams) and the length of the translation (for precision) or the mean length of the reference (for recall). The harmonic mean of precision and recall can furthermore be taken to provide the F-measure, familiar in natural language processing. Two very similar measures are ROUGE-L and ROUGE-W, derived from automatic quality measures used for assessing document summaries, and extended to MT (Lin and Och, 2004a). ROUGE-S, introduced in the same paper, computes precision, recall, and F-measure based on skip-bigram statistics, i.e., on the number of bigrams possibly interrupted by gaps.
A further measure, which can be seen as a generalization of both BLEU and ROUGE (both -L and -S), is BLANC (Lita et al., 2005). In BLANC the score is computed as a weighted sum of all matches of all subsequences (i.e., n-grams possibly interrupted by gaps) between the candidate translation and the reference. Parameters of the scoring function can be tuned on corpora for which human judgments are available in order to improve correlation with adequacy, fluency, or any other measure that is deemed relevant.

Finally, the proposers of METEOR (Banerjee and Lavie, 2005) put more weighton recall than on precision in the harmonic mean, as they observed that thisimproved correlation with human judgment. METEOR also allows matching wordswhich are not identical, based on stemming and possibly on additional linguisticprocessing.

 

Photo by Plamen Invanov©

0 comments

Comments

No comments yet. Be the first to comment.

Would you like to comment?

      SEND NOW
While recruiting a technical translator for a subject specific translation assignment,
 
Close