How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes (2409.03454v2)

Published 5 Sep 2024 in cs.CL and cs.AI

Abstract: Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning LLMs, particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.

Summary

The paper finds that larger dataset sizes, particularly over 100k segments, drive significant translation improvements as measured by a BLEU score increase of 13.7 points.
The study employs fine-tuning with QLoRA on the Llama 3 8B Instruct model across five language pairs, using consistent preprocessing and evaluation metrics.
Implications include enhanced performance in low-resource languages and improved domain-specific translation, guiding optimal data investment in translation workflows.

Overview of "How Much Data is Enough Data? Fine-Tuning LLMs for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes"

This paper empirically investigates the impact of training dataset sizes on the fine-tuning of LLMs for domain-specific machine translation (MT). Specifically, the paper employs the Llama 3 8B Instruct model to analyze the integration of translation memories (TMs) from a software organization in fine-tuning processes. The focus is on five translation directions: English to Brazilian Portuguese, Czech, German, Finnish, and Korean, covering varied resource levels. This research aims to identify the optimal dataset size for achieving significant improvements in translation quality while considering resource investment.

Methodology

The paper relies on TMs sourced from an anonymous organization within the software sector. The datasets span five target languages and are preprocessed to eliminate duplicates, non-relevant data, and segments exceeding a predefined length. Additionally, datasets are aligned interlingually to ensure consistency across languages. Various dataset sizes are compiled, ranging from 1k to 207k segments, with specific splits for training, development, and testing. The Llama 3 8B Instruct model undergoes fine-tuning using QLoRA for efficient 4-bit quantization on four A100-SXM4-80GB GPUs.

Training prompts leverage a structured scheme and specific tokens, following Meta's Llama 3 documentation. Evaluation metrics include BLEU, chrF++, TER, and COMET, alongside human evaluation by professional translators. The baseline encompasses the out-of-the-box Llama 3 8B Instruct model, serving as a reference point for assessing fine-tuning efficacy.

Key Findings

The results demonstrate a direct correlation between dataset size and translation performance enhancement, evident in metrics such as BLEU and COMET. Larger datasets consistently outperformed models trained on smaller datasets, which often exhibited deterioration compared to the baseline. BLEU scores, for instance, registered an average increase of 13.7 points on datasets exceeding 100k segments relative to the baseline model. Moreover, domains with lower initial resource levels, like Korean, exhibited substantial performance gains post-fine-tuning, suggesting that low-resource languages could benefit markedly from this approach despite poorer initial baselines.

Additionally, human evaluation indicated challenges with handling ambiguities where the model lacked sufficient context, indicating a limitation in real-world applicability. Nonetheless, the integration of TMs demonstrated potential for improving organizational translation workflows by tailoring models to domain-specific requirements.

Implications and Future Directions

This research underscores the efficacy of fine-tuning LLMs with substantial domain-specific datasets to enrich MT performance, particularly highlighting the utility of TMs for customizing translations to organizational standards. For lower resource languages, this approach offers an opportunity to attain significant improvements, compensating partly for the lack of extensive data during initial training phases.

The paper suggests future work could benefit from enhancing training configurations, particularly when working with smaller datasets to avoid overfitting. Furthermore, deploying custom test sets would provide more robust insights into model performance in practical contexts. Continuous evaluations incorporating fine-tuning hyperparameters could also yield better optimization strategies, enhancing model performance across varied domains.

In summary, the findings contribute valuable insights into balancing dataset size and resource allocation to optimize MT using LLMs, potentially informing strategic decisions for leveraging TMs within specific domain applications.

How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes (2409.03454v2)

Summary

Overview of "How Much Data is Enough Data? Fine-Tuning LLMs for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes"

Methodology

Key Findings

Implications and Future Directions

Tweets

YouTube

How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes (2409.03454v2)

Summary

Overview of "How Much Data is Enough Data? Fine-Tuning LLMs for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes"

Methodology

Key Findings

Implications and Future Directions

Related Papers

Tweets

YouTube