A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models (2309.11674v2)

Published 20 Sep 2023 in cs.CL

Abstract: Generative LLMs have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced LLM-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.

References (54)

Citations (34)

View on Semantic Scholar

Summary

The paper introduces ALMA, a two-stage fine-tuning method that first leverages non-English monolingual data before refining with high-quality parallel data.
It achieves an average boost of 12 BLEU/COMET points across 10 translation directions, outperforming larger models like GPT-3.5 and NLLB-54B.
The strategy reduces data and computational demands, unlocking the latent linguistic potential of smaller LLMs for efficient multilingual NLP.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of LLMs

The paper "A Paradigm Shift in Machine Translation: Boosting Translation Performance of LLMs" presents a refined approach for enhancing the translation capabilities of LLMs with a focus on modest-sized models (specifically, those with 7B or 13B parameters). Unlike traditional supervised encoder-decoder models, these LLMs have historically underperformed in translation tasks, particularly when not leveraging large, diverse datasets.

The proposed method, termed ALMA (Advanced LLM-based Translator), departs from conventional reliance on vast parallel corpora. It introduces a bifurcated fine-tuning paradigm: an initial fine-tuning step on non-English monolingual data followed by a targeted fine-tuning on a small set of high-quality parallel data. This process mitigates the excessive demand for parallel data, aiming to exploit the inherent linguistic knowledge of LLMs more effectively.

Key Findings and Numerical Results

The paper demonstrates that applying this novel fine-tuning strategy to LLaMA-2 models results in significant translation improvements. Empirical evaluations show that the ALMA models surpass their zero-shot language translation performance by an impressive 12 BLEU and COMET score improvements on average, across 10 translation directions from the WMT'21 and WMT'22 datasets. These results are noteworthy considering the model size limitations; ALMA's 7B and 13B models outperform notably larger models such as GPT-3.5-text-davinci-003 and even NLLB-54B.

Additionally, the research underscores that merely training on 1B monolingual tokens could achieve results comparable to the best existing models, significantly optimizing computational efficiency—a significant breakthrough given the reduced resources required.

Implications and Future Directions

The two-stage fine-tuning paradigm not only offers a practical solution to the challenges posed by smaller LLMs in translation but also prompts a revisitation of training approaches for NLP tasks. The findings suggest that LLMs possess an untapped cross-linguistic potential that can be unlocked via strategic data usage rather than sheer volume.

From a theoretical standpoint, the work prompts further refinement in understanding how linguistic proficiency is encoded within LLMs and how it can be best harnessed without extensive datasets. Practically, this approach trims both training times and necessary data volumes, thereby enhancing the accessibility and deployment scope of LLMs for diverse translation applications.

Looking ahead, this methodology could potentially be generalized for other multilingual NLP tasks beyond translation, augmenting the overall applicability and efficiency of LLMs in global, resource-constrained environments.

In conclusion, this paper presents a well-founded method for advancing machine translation with LLMs, proposing a shift from data-heavy training approaches to more efficient and strategically fine-tuned methodologies.