Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement (2402.16379v3)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. In-context examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873. Association for Computational Linguistics.
  3. Anthropic. 2023. Model card and evaluations for claude models.
  4. Iterative translation refinement with large language models. arXiv preprint arXiv:2306.03856.
  5. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  6. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083.
  7. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  8. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68. Association for Computational Linguistics.
  9. Unleashing the power of chatgpt for translation: An empirical study. arXiv preprint arXiv:2304.02182.
  10. The unreasonable effectiveness of few-shot learning for machine translation. In Proceedings of the 40th International Conference on Machine Learning, pages 10867–10878. PMLR.
  11. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv, abs/2305.11738.
  12. Cometoid: Distilling strong reference-based machine translation metrics into Even stronger quality estimation metrics. In Proceedings of the Eighth Conference on Machine Translation, pages 751–755, Singapore. Association for Computational Linguistics.
  13. xcomet: Transparent machine translation evaluation through fine-grained error detection. arXiv preprint arXiv:2310.10482.
  14. Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118.
  15. Improving machine translation with human feedback: An exploration of quality estimation as a reward model. arXiv preprint arXiv:2401.12873.
  16. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
  17. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  18. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
  19. MetricX-23: The Google submission to the WMT 2023 metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 756–767.
  20. Marzena Karpinska and Mohit Iyyer. 2023. Large language models effectively leverage document-level context for literary translation, but critical errors persist. arXiv preprint arXiv:2304.03245.
  21. Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42.
  22. Tom Kocmi and Christian Federmann. 2023. Gemba-mqm: Detecting translation quality error spans with gpt-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775.
  23. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494.
  24. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  25. Self-checker: Plug-and-play modules for fact-checking with large language models. arXiv preprint arXiv:2305.14623.
  26. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
  27. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
  28. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  29. Adaptive machine translation with large language models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 227–237.
  30. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188.
  31. Towards making the most of ChatGPT for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5622–5633, Singapore. Association for Computational Linguistics.
  32. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780.
  33. MaTESe: Machine translation evaluation as a sequence tagging problem. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 569–577, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  34. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191.
  35. Leveraging gpt-4 for automatic translation post-editing. arXiv preprint arXiv:2305.14878.
  36. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
  37. Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645.
  38. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  39. Bleurt: Learning robust metrics for text generation. In Proceedings of ACL.
  40. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  41. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  42. Prompting PaLM for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15406–15427.
  43. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210.
  44. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
  45. Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
  46. Yangjian Wu and Gang Hu. 2023. Exploring prompt engineering with GPT language models for document-level machine translation: Insights and findings. In Proceedings of the Eighth Conference on Machine Translation, pages 166–169, Singapore. Association for Computational Linguistics.
  47. A paradigm shift in machine translation: Boosting translation performance of large language models.
  48. Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098.
  49. Prompting large language model for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning, ICML’23.
  50. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhaopeng Feng (13 papers)
  2. Yan Zhang (954 papers)
  3. Hao Li (803 papers)
  4. Wenqiang Liu (18 papers)
  5. Jun Lang (5 papers)
  6. Yang Feng (230 papers)
  7. Jian Wu (314 papers)
  8. Zuozhu Liu (78 papers)
  9. Bei Wu (6 papers)
  10. Jiayu Liao (2 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets