Statistical machine translation systems conduct a nonexhaustive search of the (extremely large) space of all possible translations by keeping a list of the current n-best candidates. In practice, it was observed that the ranking of the candidates within the n-best list can be fairly poor, which means that the system is unable to return the best of the available N translations. In this chapter we propose a novel algorithm for reranking these n-best candidates. Our approach was successfully applied to large-scale, state-of-the-art commercial systems that are trained on up to three orders of magnitude more data than previously reported in reranking studies. In order to reach this goal, we create an ensemble of rerankers that are trained in parallel, each of them using just a fraction of the available data. Our empirical evaluation on two mature language pairs, Chinese-English and French-English, shows improvements of around 0.5 and 0.2 BLEU on corpora of 80 million and 1.1 billion words, respectively.
Statistical machine translation (SMT) systems, which are trained on parallel corpora of bilingual text (e.g., French and English), typically work as follows: for each sentence to be translated, they generate a plethora of possible translations, from which they keep a smaller n-best list of the most likely translations. Even though the typical n-best list contains mostly high-quality candidates, the actual ranking is far from accurate: as shown in Och et al. (2004), the n-best list usually contains many translations that are of higher quality than the top-ranked candidate.
In order to deal with this problem, researchers have proposed a variety of reranking approaches. Previous work (Liang et al., 2006; Shen and Joshi, 2005; Roark et al., 2004) shows that machine learning can be successfully used for this task. However, as the existing experiments were conducted on small corpora (the rerankers were trained on less than 1 million words), it is unclear how the results scale to real-world applications, in which one has to deal with several orders of magnitude more data.
We present here an efficient approach for building reranking systems for commercial, large-scale SMT systems. In this chapter, we make two main contributions. First, we introduce a novel reranking algorithm that can be trained efficiently on more than a billion words. In order to speed up its training process, we learn in parallel an ensemble of rerankers, each of which is using only a fraction of the training data. Second, we show that such rerankers can improve the performance of stateof-the-art SMT systems; more precisely, we used our novel approach for two mature language pairs, Chinese-English and French-English, which were trained on 80 million and 1.1 billion words, respectively. Our empirical results show that reranking has improved the quality of the translation for both systems by approximately 0.5 and 0.2 BLEU points (Papineni et al., 2002), respectively. Our reranking algorithm is designed to improve the performance of a baseline SMT system similar to the one described in Och et al. (2004). In keeping with the typical SMT methodology, such systems are trained on a large training set, tuned on a small development set, and then evaluated on a blind test set. In the remainder of this section we explain how the various data sets are used to build the system.
In order to build an SMT system that translates from the source to the target language (say, French into English), one starts with a large training corpus (i.e., parallel text) that is used to learn the system’s translation and language models.
The former consists of phrase pairs that are likely translations of each other, while the latter consists of the probabilities of the various phrases in the target language. For example, the translation model may learn that, according to the (imperfect) training data, it is highly likely that J’ai faim should be translated into I’m hungry, even though it may be sometimes be translated—less accurately—into I’m so hungry! In contrast, the language model contains the probabilities of encountering various English phrases in a large English corpus; e.g., it may be significantly more likely to encounter the phrase I’m hungry than I’m so hungry!
In practice, a typical training set is a large corpus (e.g., billions of words) that is obtained by concatenating documents from a plethora of genres that were translated with various levels of accuracy. Consequently, in order to optimize the system’s performance, researchers perform an additional tuning step in which they use a development set; this development set is fairly small in size (typically around 1000 sentences), but contains high-quality translations of genres similar to the ones in the blind test set.
During this tuning phase, for each source sentence in the development set, the system adjusts the weights of a log-linear model to favor the most accurate among the n-best translations (i.e., the one that is most similar to the reference translation). This log-linear model combines a variety of features such as the probabilities obtained from the language and translation models, the length (in words) of the translated sentence, etc. The ranking within each n-best list output by the SMT system is based on the so-called decoder cost (Och et al., 2004), which consists of the weighted combination of all the features according to the log-linear model.
In this chapter, we introduce a reranking algorithm that takes as input the rankedn-best lists for all sentences in both the training and development sets. Besidesthe decoder cost described above, we also use as features the phrases from thetranslation model that were used to generate the translations in the n-best list. Wewill show that our reranker improves the performance of the baseline system, andthat it can be efficiently trained on n-best lists from tens of millions of sentencesto find the most salient among the millions of available features.

Photo by Plamen Invanov©
Comments
No comments yet. Be the first to comment.
Would you like to comment?