Introduction
Advancements in neural machine translation (MT) have largely been associated with transformer encoder-decoder architectures. With the advent of decoder-only LLMs, such as the GPT series, these models have started to show promising results in various NLP tasks, including translation. However, there is a notable performance gap between moderate-sized LLMs (7B to 13B parameters) and their larger counterparts or conventional translation models. The paper under discussion addresses this gap by examining the limitations of supervised fine-tuning (SFT) and introduces a novel training methodology.
The Problem with Supervised Fine-Tuning
The authors highlight a compelling issue with the current supervised fine-tuning (SFT) approach: it relies heavily on the quality of reference data. Even human-annotated datasets contain imperfections that can hamper model performance when models are trained to mimic these reference translations. By relying on such datasets for model evaluation, the full potential of translation models may unintentionally be capped. In turn, this reliance limits our ability to evaluate translation effectiveness accurately with reference-based metrics.
A New Fine-Tuning Approach: Contrastive Preference Optimization (CPO)
Countering the shortcomings of SFT, the researchers introduce Contrastive Preference Optimization (CPO), a new fine-tuning approach. Unlike conventional methods focused exclusively on mimicking gold reference translations, CPO nudges models to avoid generating "adequate but not perfect" translations. Using a specially curated preference dataset derived from high-quality translations, CPO guides models to discern superior translation options. This method demonstrates that even a moderate-sized LLM can rival state-of-the-art models with minimal additional parameters and dataset changes.
Results and Insights
The implementation of CPO on the ALMA model with minimal resources produced a variant, ALMA-R, that matches or exceeds the performance of leading models, such as GPT-4 and WMT competition winners, on benchmark datasets like WMT’21, WMT’22, and WMT’23. Importantly, these promising results were achieved with only an additional 12 million parameters (0.1% of the original model size) and 22,000 parallel sentences.
Conclusion
The paper’s exploration raises essential questions about the efficacy of current MT models' fine-tuning methods and the quality of gold reference datasets. By innovatively training the ALMA model with the CPO method, researchers can successfully bridge the performance divide that separates moderate-sized LLMs from their larger or more specialized counterparts. The findings underscore the potential for moderate-sized LLMs within the field of machine translation, marking a pivotal step toward more efficient and high-performing LLMs.