- The paper finds that larger dataset sizes, particularly over 100k segments, drive significant translation improvements as measured by a BLEU score increase of 13.7 points.
- The study employs fine-tuning with QLoRA on the Llama 3 8B Instruct model across five language pairs, using consistent preprocessing and evaluation metrics.
- Implications include enhanced performance in low-resource languages and improved domain-specific translation, guiding optimal data investment in translation workflows.
This paper empirically investigates the impact of training dataset sizes on the fine-tuning of LLMs for domain-specific machine translation (MT). Specifically, the paper employs the Llama 3 8B Instruct model to analyze the integration of translation memories (TMs) from a software organization in fine-tuning processes. The focus is on five translation directions: English to Brazilian Portuguese, Czech, German, Finnish, and Korean, covering varied resource levels. This research aims to identify the optimal dataset size for achieving significant improvements in translation quality while considering resource investment.
Methodology
The paper relies on TMs sourced from an anonymous organization within the software sector. The datasets span five target languages and are preprocessed to eliminate duplicates, non-relevant data, and segments exceeding a predefined length. Additionally, datasets are aligned interlingually to ensure consistency across languages. Various dataset sizes are compiled, ranging from 1k to 207k segments, with specific splits for training, development, and testing. The Llama 3 8B Instruct model undergoes fine-tuning using QLoRA for efficient 4-bit quantization on four A100-SXM4-80GB GPUs.
Training prompts leverage a structured scheme and specific tokens, following Meta's Llama 3 documentation. Evaluation metrics include BLEU, chrF++, TER, and COMET, alongside human evaluation by professional translators. The baseline encompasses the out-of-the-box Llama 3 8B Instruct model, serving as a reference point for assessing fine-tuning efficacy.
Key Findings
The results demonstrate a direct correlation between dataset size and translation performance enhancement, evident in metrics such as BLEU and COMET. Larger datasets consistently outperformed models trained on smaller datasets, which often exhibited deterioration compared to the baseline. BLEU scores, for instance, registered an average increase of 13.7 points on datasets exceeding 100k segments relative to the baseline model. Moreover, domains with lower initial resource levels, like Korean, exhibited substantial performance gains post-fine-tuning, suggesting that low-resource languages could benefit markedly from this approach despite poorer initial baselines.
Additionally, human evaluation indicated challenges with handling ambiguities where the model lacked sufficient context, indicating a limitation in real-world applicability. Nonetheless, the integration of TMs demonstrated potential for improving organizational translation workflows by tailoring models to domain-specific requirements.
Implications and Future Directions
This research underscores the efficacy of fine-tuning LLMs with substantial domain-specific datasets to enrich MT performance, particularly highlighting the utility of TMs for customizing translations to organizational standards. For lower resource languages, this approach offers an opportunity to attain significant improvements, compensating partly for the lack of extensive data during initial training phases.
The paper suggests future work could benefit from enhancing training configurations, particularly when working with smaller datasets to avoid overfitting. Furthermore, deploying custom test sets would provide more robust insights into model performance in practical contexts. Continuous evaluations incorporating fine-tuning hyperparameters could also yield better optimization strategies, enhancing model performance across varied domains.
In summary, the findings contribute valuable insights into balancing dataset size and resource allocation to optimize MT using LLMs, potentially informing strategic decisions for leveraging TMs within specific domain applications.