Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models (2309.11674v2)

Published 20 Sep 2023 in cs.CL
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Abstract: Generative LLMs have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced LLM-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of LLMs

The paper "A Paradigm Shift in Machine Translation: Boosting Translation Performance of LLMs" presents a refined approach for enhancing the translation capabilities of LLMs with a focus on modest-sized models (specifically, those with 7B or 13B parameters). Unlike traditional supervised encoder-decoder models, these LLMs have historically underperformed in translation tasks, particularly when not leveraging large, diverse datasets.

The proposed method, termed ALMA (Advanced LLM-based Translator), departs from conventional reliance on vast parallel corpora. It introduces a bifurcated fine-tuning paradigm: an initial fine-tuning step on non-English monolingual data followed by a targeted fine-tuning on a small set of high-quality parallel data. This process mitigates the excessive demand for parallel data, aiming to exploit the inherent linguistic knowledge of LLMs more effectively.

Key Findings and Numerical Results

The paper demonstrates that applying this novel fine-tuning strategy to LLaMA-2 models results in significant translation improvements. Empirical evaluations show that the ALMA models surpass their zero-shot language translation performance by an impressive 12 BLEU and COMET score improvements on average, across 10 translation directions from the WMT'21 and WMT'22 datasets. These results are noteworthy considering the model size limitations; ALMA's 7B and 13B models outperform notably larger models such as GPT-3.5-text-davinci-003 and even NLLB-54B.

Additionally, the research underscores that merely training on 1B monolingual tokens could achieve results comparable to the best existing models, significantly optimizing computational efficiency—a significant breakthrough given the reduced resources required.

Implications and Future Directions

The two-stage fine-tuning paradigm not only offers a practical solution to the challenges posed by smaller LLMs in translation but also prompts a revisitation of training approaches for NLP tasks. The findings suggest that LLMs possess an untapped cross-linguistic potential that can be unlocked via strategic data usage rather than sheer volume.

From a theoretical standpoint, the work prompts further refinement in understanding how linguistic proficiency is encoded within LLMs and how it can be best harnessed without extensive datasets. Practically, this approach trims both training times and necessary data volumes, thereby enhancing the accessibility and deployment scope of LLMs for diverse translation applications.

Looking ahead, this methodology could potentially be generalized for other multilingual NLP tasks beyond translation, augmenting the overall applicability and efficiency of LLMs in global, resource-constrained environments.

In conclusion, this paper presents a well-founded method for advancing machine translation with LLMs, proposing a shift from data-heavy training approaches to more efficient and strategically fine-tuned methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089, 2019.
  2. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Improving translation faithfulness of large language models via augmenting instructions. arXiv preprint arXiv:2308.12674, 2023.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053, 2018.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  8. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  46–68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.2.
  9. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  10. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  11. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  12. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
  13. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  14. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  15. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  16. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacl˙a˙00447. URL https://aclanthology.org/2022.tacl-1.4.
  17. Eliciting the translation ability of large language models via multilingual finetuning with translation instructions. arXiv preprint arXiv:2305.15083, 2023.
  18. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668, 2021.
  19. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  20. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020. doi: 10.1162/tacl˙a˙00343. URL https://aclanthology.org/2020.tacl-1.47.
  21. Small data, big impact: Leveraging minimal data for effective machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2740–2756, 2023.
  22. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  23. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  24. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938, 2023.
  25. LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp.  46–51, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-0906. URL https://aclanthology.org/W17-0906.
  26. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp.  9 – 16, Mannheim, 2019. Leibniz-Institut f”ur Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  30. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp.  186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://aclanthology.org/W18-6319.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
  32. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  33. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  578–585, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.52.
  34. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280, 2020. doi: 10.1162/tacl˙a˙00313. URL https://aclanthology.org/2020.tacl-1.18.
  35. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  36. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022a.
  37. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022b.
  38. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3534–3546, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.310. URL https://aclanthology.org/2021.findings-acl.310.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  41. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp. 22964–22984. PMLR, 2022.
  42. Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018, 2023.
  43. BERT, mBERT, or BiBERT? a study on contextualized embeddings for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6663–6675, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.534. URL https://aclanthology.org/2021.emnlp-main.534.
  44. Language-aware multilingual machine translation with self-supervised learning. In Findings of the Association for Computational Linguistics: EACL 2023, pp.  526–539, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-eacl.38. URL https://aclanthology.org/2023.findings-eacl.38.
  45. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098, 2023.
  46. Tim: Teaching large language models to translate with comparison. arXiv preprint arXiv:2307.04408, 2023.
  47. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069, 2023a.
  48. The effect of translationese in machine translation test sets. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp.  73–81, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5208. URL https://aclanthology.org/W19-5208.
  49. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  50. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv preprint arXiv:2306.10968, 2023c.
  51. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  52. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  53. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675, 2023a.
  54. Extrapolating large language models to non-english by aligning languages. arXiv preprint arXiv:2308.04948, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haoran Xu (77 papers)
  2. Young Jin Kim (31 papers)
  3. Amr Sharaf (13 papers)
  4. Hany Hassan Awadalla (24 papers)
Citations (34)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com