Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models (2401.08350v3)

Published 16 Jan 2024 in cs.CL

Abstract: The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced LLMs: domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at https://github.com/pangjh3/LLM4MT.

PDF Abstract

Introduction

Machine Translation (MT) remains one of the keystones in Natural Language Processing, with the ultimate objective of converting text between human languages. The field has made substantial progress, particularly with the advent of Neural Machine Translation (NMT) and more recently, LLMs. While historical benchmarks have set the pace for advancements in MT, the dynamic nature of LLMs has prompted a re-examination of these longstanding challenges. This paper harnesses the power of LLMs to revisit these milestones, with a primary focus on six core challenges that define MT progression.

Experimental Setup

The authors' methodology integrates Llama2-7b, an expansive LLM that boasts 7 billion parameters, accessible via HuggingFace. The model undergoes supervised fine-tuning with specific instruction formats to hone its translation capabilities, particularly German to English, considering the language pair's abundance in pretraining data. Distinctive strategies are applied, one involving supervised fine-tuning with parallel data coupled with the Alpaca dataset for instruction adherence, and another incorporating continuous pretraining followed by fine-tuning with Alpaca. Parallel to this, the paper involves encoder-to-decoder transformer models for baseline comparison, trained on various dataset sizes using the Fairseq toolkit. The analysis extends to measure the impact of diverse data conditions and cross-domain translation tasks.

Challenges Revisited

The analysis explores the six MT challenges originally posited by Koehn and Knowles in 2017:

Domain Mismatch: LLMs show improvements in addressing out-of-domain tasks, yet issues like terminology mismatch and hallucinations persist.
Amount of Parallel Data: LLMs reduce reliance on bilingual data for major pretraining languages, suggesting an evolution in model training approach.
Rare Word Prediction: Consistent difficulties arise in predicting infrequent words, a point of concern that remains unresolved.
Translation of Long Sentences: LLMs effectively translate long sentences, demonstrating capabilities even at the document level, which signify substantial progress.
Word Alignment: Traditional word alignment extraction from attention models doesn't apply to LLMs, posing interpretability challenges.
Inference Efficiency: LLMs face significant latency issues during inference, creating a bottleneck for real-time translation application.

Together with these six challenges, the paper also unveils three new challenges specific to LLMs in translation: inference efficiency, pretraining phase translation for low-resource languages, and alignment of evaluation methodologies with human judgment criteria.

Implications and Future Directions

The integration of LLMs into MT has elucidated not only the ongoing relevance of past challenges but also the emergence of novel hurdles. The advancements in handling long sentences and diminishing dependence on parallel data are contrasted by persistent domain mismatches and rare word prediction issues. Further intricate problems include inference latency and the disparity in pretraining resources across different languages, underscoring the necessity for balanced datasets. In essence, the paper catalyzes the examination of automatic evaluation methods to ensure a better alignment with human judgment, a task increasingly significant with the continuous evolution of LLMs.

While LLMs herald a promising future for the MT landscape, the paper invites contemplation on their practicality and interpretability. Both empirical and theoretical inquiries bear the potential to advance the fidelity of translating machines and contribute to more nuanced human-like language processing.