Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting Large Language Models for Document-Level Machine Translation (2401.06468v4)

Published 12 Jan 2024 in cs.CL
Adapting Large Language Models for Document-Level Machine Translation

Abstract: LLMs have significantly advanced various NLP tasks. Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning. This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs. We first investigate the impact of prompt strategies on translation performance and then conduct extensive experiments using two fine-tuning methods, three LLM backbones, and 18 translation tasks across nine language pairs. Our results show that specialized models can sometimes surpass GPT-4 in translation performance but still face issues like off-target translation due to error propagation in decoding. We provide an in-depth analysis of these LLMs tailored for DocMT, examining translation errors, discourse phenomena, strategies for training and inference, the data efficiency of parallel documents, recent test set evaluations, and zero-shot crosslingual transfer. Our findings highlight the strengths and limitations of LLM-based DocMT models and provide a foundation for future research.

Introduction to LLMs in Document-Level Translation

The potential of LLMs in the field of NLP has been demonstrated consistently across a variety of applications, carving out an impressive track record in tasks such as text generation, summarization, and question-answering. In the specific arena of document-level machine translation (DocMT), which seeks to maintain the context and coherence across sentences in a document during translation, these models have presented remarkable but sometimes inconsistent results. This summary explores extensive research conducted to adapt LLMs for DocMT across multiple language pairs, focusing on the comparative performance of differently sized models with variations in fine-tuning techniques.

Exploring Fine-Tuning Strategies for Translation

Moderately-sized LLMs, those containing around 7 billion parameters, were meticulously fine-tuned using two approaches: Parameter-Efficient Fine-Tuning (PEFT) and Fully Fine-Tuning (FFT). These methods were assessed through an array of metrics designed to accurately gauge translation quality. Despite their exceptional performance on some tasks, LLMs still assorted challenges such as the production of "off-target" translations, wherein the output would be in an incorrect language. Moreover, the paper digs into the vital role of prompting strategies during the fine-tuning phase, revealing that certain prompt structures can significantly enhance LLM capabilities in translation tasks.

Key Findings in Translation Performance

The investigation bore fruit in several key findings when it compared the translation proficiency of LLMs against other state-of-the-art models. The paper found that fine-tuned LLMs can surpass the translation abilities of even GPT-4, one of the largest available models, in certain tasks. However, success is selective, and in other scenarios, these same models failed completely due to off-target translation issues. Remarkably, the smaller, fine-tuned LLMs displayed fewer errors when performance metrics were aligned with larger models. Additionally, the fine-tuning methods have shown different efficiency levels; for instance, the FFT method required only about 1% of the full dataset to match the performance achieved with the whole set, while PEFT needed 10%.

Advancements and Implications for DocMT

The research implications extend to how LLMs are compared to traditional document-level machine translations. When evaluated on recently created test sets, the fine-tuned LLMs manifested better generalization on out-of-domain text compared to conventional DocMT models. Furthermore, the paper found that base LLMs supplemented with task-specific supervised fine-tuning exhibit superior zero-shot cross-lingual transfer capabilities over instruction-tuned LLMs.

The compelling evidence suggests that fine-tuning LLMs on parallel documents can unlock sophisticated translation abilities, thereby improving DocMT models distinctively. These models become particularly advantageous for tasks involving low-resource languages, potentially redefining translation approaches for diverse language pairs. The paper sets a solid foundation for ongoing research and development in the field of machine translation, signposting the journey towards more refined, contextually aware, and accurate translation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Palm 2 technical report. CoRR, abs/2305.10403.
  2. Llemma: An open language model for mathematics. CoRR, abs/2310.10631.
  3. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
  4. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Aljoscha Burchardt. 2013. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35, London, UK. Aslib.
  7. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 2–14, Tokyo, Japan. International Workshop on Spoken Language Translation.
  8. On the off-target problem of zero-shot multilingual neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9542–9558, Toronto, Canada. Association for Computational Linguistics.
  9. Monolingual or multilingual instruction tuning: Which makes a better alpaca. CoRR, abs/2309.08958.
  10. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  11. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  12. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
  13. Learn to remember: Transformer with recurrent memory for document-level machine translation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1409–1420, Seattle, United States. Association for Computational Linguistics.
  14. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  15. Textbooks are all you need. CoRR, abs/2306.11644.
  16. How good are GPT models at machine translation? A comprehensive evaluation. CoRR, abs/2302.09210.
  17. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  18. Diving deep into context-aware neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 604–616, Online. Association for Computational Linguistics.
  19. Mistral 7b. CoRR, abs/2310.06825.
  20. Is chatgpt A good translator? A preliminary study. CoRR, abs/2301.08745.
  21. Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
  22. Proceedings of the Eighth Conference on Machine Translation. Association for Computational Linguistics, Singapore.
  23. Large language models are zero-shot reasoners. In NeurIPS.
  24. MADLAD-400: A multilingual and document-level large audited dataset. CoRR, abs/2309.04662.
  25. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
  26. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  27. Chain-of-dictionary prompting elicits translation in large language models. CoRR, abs/2305.06575.
  28. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568.
  29. A simple and effective unified encoder for document-level machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3505–3511, Online. Association for Computational Linguistics.
  30. Valentin Macé and Christophe Servan. 2019. Using whole document context in neural machine translation. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
  31. When less is more: Investigating data pruning for pretraining llms at scale. CoRR, abs/2309.04564.
  32. Sameen Maruf and Gholamreza Haffari. 2018. Document context neural machine translation with memory networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1275–1284, Melbourne, Australia. Association for Computational Linguistics.
  33. Selective attention for context-aware neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3092–3102, Minneapolis, Minnesota. Association for Computational Linguistics.
  34. A survey on document-level neural machine translation: Methods and evaluation. ACM Comput. Surv., 54(2):45:1–45:36.
  35. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.
  36. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  37. Adaptive machine translation with large language models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 227–237, Tampere, Finland. European Association for Machine Translation.
  38. Crosslingual generalization through multitask finetuning. CoRR, abs/2211.01786.
  39. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. CoRR, abs/2309.09400.
  40. Data augmentation by concatenation for low-resource translation: A mystery and a solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 287–293, Bangkok, Thailand (online). Association for Computational Linguistics.
  41. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  42. Training language models to follow instructions with human feedback. In NeurIPS.
  43. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  44. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  45. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  46. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
  47. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  48. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  49. Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705.
  50. Rethinking document-level neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3537–3548, Dublin, Ireland. Association for Computational Linguistics.
  51. Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
  52. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  53. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  54. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6:407–420.
  55. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1198–1212, Florence, Italy. Association for Computational Linguistics.
  56. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.
  57. A survey on zero pronoun translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3325–3339, Toronto, Canada. Association for Computational Linguistics.
  58. Document-level machine translation with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16646–16661, Singapore. Association for Computational Linguistics.
  59. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2826–2831, Copenhagen, Denmark. Association for Computational Linguistics.
  60. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  61. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  62. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  63. Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online. Association for Computational Linguistics.
  64. Contextual neural machine translation improves translation of cataphoric pronouns. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5971–5978, Online. Association for Computational Linguistics.
  65. Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. CoRR, abs/2307.03025.
  66. Document flattening: Beyond concatenating context for document-level neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 448–462, Dubrovnik, Croatia. Association for Computational Linguistics.
  67. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
  68. A paradigm shift in machine translation: Boosting translation performance of large language models. CoRR, abs/2309.11674.
  69. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  70. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. CoRR, abs/2305.18098.
  71. Multilingual document-level translation enables zero-shot transfer from sentences to documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4176–4192, Dublin, Ireland. Association for Computational Linguistics.
  72. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online. Association for Computational Linguistics.
  73. Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 533–542, Brussels, Belgium. Association for Computational Linguistics.
  74. Long-short term masking transformer: A simple but effective baseline for document-level neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1081–1087, Online. Association for Computational Linguistics.
  75. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. CoRR, abs/2306.10968.
  76. Multilingual machine translation with large language models: Empirical results and analysis. CoRR, abs/2304.04675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minghao Wu (31 papers)
  2. Thuy-Trang Vu (23 papers)
  3. Lizhen Qu (68 papers)
  4. George Foster (24 papers)
  5. Gholamreza Haffari (141 papers)
Citations (30)