Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation (2401.06688v2)

Published 12 Jan 2024 in cs.CL and cs.LG

Abstract: Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE), which correlates better with human judgments. QE-fusion leverages a pool of candidates sampled from a model, combining spans from different candidates using a QE metric such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to LLMs used for translation (PolyLM, XGLM, Llama2, Mistral, ALMA, and Tower) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool.

Combining Machine Translation Hypotheses Using Quality Estimation: A Formal Analysis

Introduction

The paper "Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation" by Giorgos Vernikos and Andrei Popescu-Belis proposes QE-fusion, an innovative methodology designed to enhance neural machine translation (NMT) outputs. Traditionally, NMT systems estimate the probability of target sentences based on source sentences, often leveraging beam search and reranking techniques to enhance translation quality. However, such methods exhibit limitations, especially when candidate outputs contain complementary errors.

Methodology

The central contribution of this work is QE-fusion, an algorithm that synthesizes improved translations by combining spans from multiple candidates using quality estimation metrics like CometKiwi. Unlike beam search, QE-fusion begins with a pool of candidates generated via sampling techniques (e.g., nucleus sampling for LLMs and epsilon sampling for multilingual translation models). It identifies divergent spans among these candidates and creates new hypotheses, incrementally integrating spans that contribute the highest estimated quality according to the QE metric.

Experimental Setup

The paper presents a rigorous evaluation framework encompassing multiple LLMs (e.g., PolyLM, XGLM, Llama2, ALMA, and Mistral) and multilingual NMT models (e.g., NLLB) across five language pairs, using WMT22 and Flores-200 datasets. Performance metrics include BLEU, ChrF, COMET, and BLEURT, with a particular focus on neural-based metrics due to their superior alignment with human judgment.

Key Findings

Performance Improvements

QE-fusion consistently outperforms traditional methods such as beam search and advanced reranking techniques, including Minimum Bayes Risk (MBR) and QE-reranking. Noteworthy is the observation that LLMs, due to their ability to generate diverse outputs, benefit significantly from QE-fusion. The method generates novel translations in over half the cases evaluated, indicating its ability to produce outputs that the model might not generate independently.

Scalability

The authors empirically establish that QE-fusion scales linearly with the number of candidates, a critical factor given the computational expense associated with quality estimation metrics. This computational efficiency makes QE-fusion a practical choice for real-world applications without the need for retraining the underlying translation models.

Theoretical and Practical Implications

Theoretically, QE-fusion challenges the conventional reranking paradigm by demonstrating that combining candidate spans can lead to superior translations. Practically, this approach circumvents the need for expensive retraining of LLMs, offering a more efficient path toward translation improvement. The results suggest potential applications beyond MT, such as enhancing general language generation tasks through integration with reward models from Reinforcement Learning from Human Feedback (RLHF).

Future Directions

Future research could focus on:

  1. Extending QE-fusion to Other Domains: Applying the combination strategy to diverse language generation tasks.
  2. Optimizations: Further reducing computational costs through advanced techniques such as pruning or model distillation.
  3. Handling Low-Resource Languages: Investigating the efficacy of QE-fusion in low-resource language pairs, possibly incorporating external linguistic resources to boost performance.

Conclusion

The QE-fusion algorithm represents a significant advancement in machine translation, providing a robust solution to leverage the inherent diversity in model-generated candidates. This method's consistent superiority over traditional and advanced reranking techniques underlines its potential to redefine the landscape of neural machine translation and natural language processing. As a scalable and effective approach, QE-fusion is poised to facilitate the development of more nuanced and accurate translation models without necessitating intensive retraining efforts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Rachel Bawden and François Yvon. 2023. Investigating the translation performance of a large multilingual language model: the case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 157–170.
  2. Language models not just for pre-training: Fast online neural noisy channel modeling. In Proceedings of the Fifth Conference on Machine Translation, pages 584–593. Association for Computational Linguistics.
  3. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
  4. PPL-MCTS: Constrained textual generation through discriminator-guided MCTS decoding. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2953–2967, Seattle, United States.
  5. Julius Cheng and Andreas Vlachos. 2023. Faster minimum Bayes risk decoding with confidence-based pruning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12473–12480.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506–4520.
  8. Bryan Eikema and Wilker Aziz. 2022. Sampling-based approximations to minimum Bayes risk decoding for neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10978–10993.
  9. An empirical study of translation hypothesis ensembling with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11956–11970, Singapore.
  10. Quality-aware decoding for neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1396–1412.
  11. Mbr and qe finetuning: Training-time distillation of the best and most expensive decoding methods.
  12. APE at scale and its implications on MT evaluation biases. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 34–44.
  13. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  14. Epsilon sampling rocks: Investigating sampling strategies for minimum bayes risk decoding for machine translation.
  15. High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics. Transactions of the Association for Computational Linguistics, 10:811–825.
  16. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68.
  17. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774.
  18. The unreasonable effectiveness of few-shot learning for machine translation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23.
  19. Vaibhava Goel and William J Byrne. 2000. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115–135.
  20. Hallucinations in Large Multilingual Translation Models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  21. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1059–1075, Dubrovnik, Croatia.
  22. Reinforced self-training (rest) for language modeling.
  23. How good are gpt models at machine translation? a comprehensive evaluation.
  24. Truncation sampling as language model desmoothing. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3414–3427.
  25. The curious case of neural text degeneration. In International Conference on Learning Representations.
  26. Mistral 7b.
  27. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada.
  28. On the depth between beam search and exhaustive search for text generation.
  29. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494.
  30. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
  31. Shankar Kumar and William Byrne. 2002. Minimum bayes-risk word alignments of bilingual texts. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, page 140–147.
  32. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176.
  33. Machine translation decoding beyond beam search. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8410–8434.
  34. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052.
  35. Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding.
  36. NeuroLogic a*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 780–799.
  37. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997.
  38. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319.
  39. No language left behind: Scaling human-centered machine translation.
  40. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning.
  41. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  42. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  43. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  45. TransQuest: Translation quality estimation with cross-lingual transformers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5070–5081.
  46. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585.
  47. Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040.
  48. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
  49. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645.
  50. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892.
  51. Felix Stahlberg and Bill Byrne. 2019. On NMT search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3356–3362.
  52. Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121.
  53. Quality control at your fingertips: Quality-aware translation models.
  54. Llama 2: Open foundation and fine-tuned chat models.
  55. Small language models improve giants by rewriting their outputs.
  56. Embarrassingly easy document-level MT metrics: How to convert any pretrained metric into a document-level metric. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 118–128.
  57. Diverse beam search: Decoding diverse solutions from neural sequence models.
  58. Context-aware monolingual repair for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 877–886.
  59. PolyLM: An open source polyglot large language model.
  60. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
  61. A paradigm shift in machine translation: Boosting translation performance of large language models.
  62. Simple and effective noisy channel modeling for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5696–5701, Hong Kong, China.
  63. Findings of the WMT 2022 shared task on quality estimation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99.
  64. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
  65. Multilingual machine translation with large language models: Empirical results and analysis.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Giorgos Vernikos (7 papers)
  2. Andrei Popescu-Belis (13 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com