Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation (2401.06688v2)

Published 12 Jan 2024 in cs.CL and cs.LG

Abstract: Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to LLMs used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.

PDF HTML Abstract

Combining Machine Translation Hypotheses Using Quality Estimation: A Formal Analysis

Introduction

The paper "Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation" by Giorgos Vernikos and Andrei Popescu-Belis proposes QE-fusion, an innovative methodology designed to enhance neural machine translation (NMT) outputs. Traditionally, NMT systems estimate the probability of target sentences based on source sentences, often leveraging beam search and reranking techniques to enhance translation quality. However, such methods exhibit limitations, especially when candidate outputs contain complementary errors.

Methodology

The central contribution of this work is QE-fusion, an algorithm that synthesizes improved translations by combining spans from multiple candidates using quality estimation metrics like CometKiwi. Unlike beam search, QE-fusion begins with a pool of candidates generated via sampling techniques (e.g., nucleus sampling for LLMs and epsilon sampling for multilingual translation models). It identifies divergent spans among these candidates and creates new hypotheses, incrementally integrating spans that contribute the highest estimated quality according to the QE metric.

Experimental Setup

The paper presents a rigorous evaluation framework encompassing multiple LLMs (e.g., PolyLM, XGLM, Llama2, ALMA, and Mistral) and multilingual NMT models (e.g., NLLB) across five language pairs, using WMT22 and Flores-200 datasets. Performance metrics include BLEU, ChrF, COMET, and BLEURT, with a particular focus on neural-based metrics due to their superior alignment with human judgment.

Key Findings

Performance Improvements

QE-fusion consistently outperforms traditional methods such as beam search and advanced reranking techniques, including Minimum Bayes Risk (MBR) and QE-reranking. Noteworthy is the observation that LLMs, due to their ability to generate diverse outputs, benefit significantly from QE-fusion. The method generates novel translations in over half the cases evaluated, indicating its ability to produce outputs that the model might not generate independently.

Scalability

The authors empirically establish that QE-fusion scales linearly with the number of candidates, a critical factor given the computational expense associated with quality estimation metrics. This computational efficiency makes QE-fusion a practical choice for real-world applications without the need for retraining the underlying translation models.

Theoretical and Practical Implications

Theoretically, QE-fusion challenges the conventional reranking paradigm by demonstrating that combining candidate spans can lead to superior translations. Practically, this approach circumvents the need for expensive retraining of LLMs, offering a more efficient path toward translation improvement. The results suggest potential applications beyond MT, such as enhancing general language generation tasks through integration with reward models from Reinforcement Learning from Human Feedback (RLHF).

Future Directions

Future research could focus on:

Extending QE-fusion to Other Domains: Applying the combination strategy to diverse language generation tasks.
Optimizations: Further reducing computational costs through advanced techniques such as pruning or model distillation.
Handling Low-Resource Languages: Investigating the efficacy of QE-fusion in low-resource language pairs, possibly incorporating external linguistic resources to boost performance.

Conclusion

The QE-fusion algorithm represents a significant advancement in machine translation, providing a robust solution to leverage the inherent diversity in model-generated candidates. This method's consistent superiority over traditional and advanced reranking techniques underlines its potential to redefine the landscape of neural machine translation and natural language processing. As a scalable and effective approach, QE-fusion is poised to facilitate the development of more nuanced and accurate translation models without necessitating intensive retraining efforts.

PDF Markdown Bookmark Chat (Pro)

References (65)

Authors (2)

Giorgos Vernikos (7 papers)
Andrei Popescu-Belis (13 papers)

Citations (8)

View on Semantic Scholar

Tweets

https://twitter.com/gvernikos/status/1746907975028035756