Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations (2404.07851v1)

Published 11 Apr 2024 in cs.CL and cs.AI
Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations

Abstract: Machine Translation (MT) remains one of the last NLP tasks where LLMs have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.

Enhancing Machine Translation Quality through Post-Editing with LLMs

Introduction to the Study

The integration of LLMs in Machine Translation (MT) has surged as a focal point in NLP research, seeking to leverage their generative prowess to augment translation quality. This paper presents a methodology that combines LLMs' capabilities with supervised MT systems through a post-editing framework guided by external feedback on translation quality, using the LLaMA-2 model series as a case paper.

Background and Related Work

While LLMs such as ChatGPT show promising results, dedicated supervised systems still marginally outperform LLMs in many language pairs. The persisting challenges and uneven performance across different LLMs have prompted researchers to investigate the complementary strengths of LLMs and supervised systems. This paper builds on several strands of research: the potential of LLMs for text refinement, MT error annotation practices that facilitate actionable feedback instead of aggregate scores, and automatic post-editing approaches to refine MT outputs. Notably, compared to existing works that often leverage closed LLMs for post-editing, this paper ventures into utilizing open-source models, namely the 7B and 13B variants of LLaMA-2, to evaluate their efficacy in a post-editing capacity.

Methodological Approach

The paper explores two novel strategies: prompting and fine-tuning LLMs with feedback of varying granularity to enhance post-editing performance. The feedback categories include generic prompts, score-based prompts conveying overall MT quality scores, and fine-grained feedback providing detailed error annotations. The latter is further split based on the source of error annotations - human versus automatic tools. This diversified approach allows a comprehensive examination of the impact of feedback nature on post-editing outcomes.

Experimental Insights

Prompting Performance

In zero- and ten-shot scenarios, prompting with any feedback type generally improved translation quality metrics (e.g., TER, BLEU, and COMET) across all evaluated language pairs, albeit with marginal gains in zero-shot settings. The improvement was more pronounced in the ten-shot setting, highlighting the potential of few-shot learning to bridge the performance gap between different model sizes. However, the anticipated advantage of fine-grained feedback over generic feedback was not clearly demonstrated in the results.

Fine-Tuning Efficacy

Instruction based fine-tuning presented a substantial uplift in translation quality over both original translations and prompted outcomes. Particularly, fine-tuning with fine-grained feedback exhibited more considerable enhancements than generic feedback, suggesting that fine-tuning enables a more effective utilization of detailed error annotations. Moreover, multilingual fine-tuning slightly outperformed bilingual configurations, indicating potential benefits from cross-lingual learning.

Implications and Prospects

This research underscores the feasibility of employing moderately-sized, open-source LLMs for MT post-editing tasks, which invites a broader exploration across more diverse language settings and tasks. The findings advocate for the more extensive creation and use of fine-grained error annotations to train LLMs for post-editing, signifying the value of detailed feedback over aggregate scores or generic prompts.

Concluding Remarks

This paper demonstrates that LLMs, even of moderate size, can effectively enhance MT quality through post-editing when guided by external feedback. The superior performance of fine-tuning with fine-grained feedback illuminates the path for future research directions, emphasizing the creation of richly annotated MT datasets and exploring automatic error annotation tools' reliability. The paper's exploration opens avenues for further leveraging the synergistic potential between LLMs and supervised MT systems, aiming for continual improvements in translation accuracy and fluency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
  2. Tower: An open multilingual large language model for translation-related tasks.
  3. Palm 2 technical report.
  4. Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: The Case of BLOOM.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Findings of the WMT 2018 Shared Task on Automatic Post-Editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 710–725, Belgium, Brussels. Association for Computational Linguistics.
  7. Iterative translation refinement with large language models.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  9. Qlora: Efficient finetuning of quantized llms.
  10. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
  11. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  12. Gptscore: Evaluate as you desire.
  13. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
  14. Levenshtein transformer. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1003, pages 11181–11191. Curran Associates Inc., Red Hook, NY, USA.
  15. xcomet: Transparent machine translation evaluation through fine-grained error detection.
  16. How good are gpt models at machine translation? a comprehensive evaluation.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.
  19. Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 751–758, Berlin, Germany. Association for Computational Linguistics.
  20. DEMETR: Diagnosing evaluation metrics for translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9540–9561, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  21. Kevin Knight and Ishwar Chander. 1994. Automated postediting of documents. In Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, AAAI’94, pages 779–784, Seattle, Washington. AAAI Press.
  22. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42, Singapore. Association for Computational Linguistics.
  23. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  24. Tom Kocmi and Christian Federmann. 2023a. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
  25. Tom Kocmi and Christian Federmann. 2023b. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
  26. Large Language Models are Zero-Shot Reasoners.
  27. Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
  28. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  30. Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics. Revista tradumàtica: traducció i tecnologies de la informació i la comunicació, pages 455–463.
  31. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  32. OpenAI. 2023. Gpt-4 technical report.
  33. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  34. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.
  35. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
  36. REFINER: Reasoning feedback on intermediate representations. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1100–1126, St. Julian’s, Malta. Association for Computational Linguistics.
  37. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  38. CoEdIT: Text Editing by Task-Specific Instruction Tuning.
  39. Leveraging GPT-4 for automatic translation post-editing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12009–12024, Singapore. Association for Computational Linguistics.
  40. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  41. A Recipe for Arbitrary Text Style Transfer with Large Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
  42. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  43. Statistical Phrase-Based Post-Editing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 508–515, Rochester, New York. Association for Computational Linguistics.
  44. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
  45. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  46. Llama: Open and efficient foundation language models.
  47. Prompting PaLM for Translation: Assessing Strategies and Performance.
  48. Incorporating Terminology Constraints in Automatic Post-Editing. In Proceedings of the Fifth Conference on Machine Translation, pages 1193–1204, Online. Association for Computational Linguistics.
  49. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  50. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
  51. Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback.
  52. INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computational Linguistics.
  53. Improving machine translation with large language models: A preliminary study with cooperative decoding.
  54. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Dayeon Ki (10 papers)
  2. Marine Carpuat (56 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com