Document-Level Machine Translation with Large Language Models (2304.02210v2)

Published 5 Apr 2023 in cs.CL and cs.AI

Abstract: LLMs such as ChatGPT can produce coherent, cohesive, relevant, and fluent answers for various NLP tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study focuses on three aspects: 1) Effects of Context-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of ChatGPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and shed light on impacts of training techniques on discourse modeling. By evaluating on a number of benchmarks, we surprisingly find that LLMs have demonstrated superior performance and show potential to become a new paradigm for document-level translation: 1) leveraging their powerful long-text modeling capabilities, GPT-3.5 and GPT-4 outperform commercial MT systems in terms of human evaluation; 2) GPT-4 demonstrates a stronger ability for probing linguistic knowledge than GPT-3.5. This work highlights the challenges and opportunities of LLMs for MT, which we hope can inspire the future design and evaluation of LLMs.We release our data and annotations at https://github.com/longyuewangdcu/Document-MT-LLM.

PDF Abstract

Analysis of Document-Level Machine Translation with LLMs

The paper explores the capabilities of LLMs, particularly GPT-3.5 and GPT-4, in handling document-level machine translation (MT) tasks, investigating their proficiency in discourse modeling. The paper explores several dimensions: the effects of context-aware prompts, an inter-model comparison against commercial MT systems, and an analysis of LLMs' encoded discourse knowledge.

Effects of Context-Aware Prompts

The research begins by exploring how the design of context-aware prompts influences the performance of ChatGPT in document-level translation. The paper identifies three key prompt formulations—P1, P2, and P3. The paper concludes that utilizing multi-turn contexts without relying on sentence boundaries (P3) enhances translation quality and discourse awareness. This finding underscores the importance of prompting strategies in leveraging the long-form capacity of LLMs for enhanced coherence and context integration in translation tasks.

Comparative Analysis with Commercial and Advanced Translation Systems

The evaluation extends to benchmarking the translation performance of GPT-3.5 and GPT-4 against leading commercial translation systems such as Google Translate, DeepL, and Tencent TranSmart, as well as advanced document-level NMT methods like MR-Doc2Doc. Although commercial systems generally outperform LLMs in automatic evaluation metrics like document-level BLEU (d-BLEU), GPT-4 exhibits superior performance in human evaluations, particularly in informal languages and domains like Q&A and fiction. This contrast points to the nuanced capability of LLMs to capture discourse-level information and maintain coherence, which may not always align with traditional automated metrics.

Probing Discourse Knowledge

Further analysis is conducted to probe how well LLMs capture and utilize discourse-level phenomena such as deixis, lexical consistency, and ellipsis. The paper employs a probing method using contrastive testing and explanations, revealing that while GPT-3.5 trails behind document-enhanced methods like DocRepair, GPT-4 achieves notable improvements, suggesting a substantial enhancement in its ability to model and utilize discourse knowledge. This is attributed to the integration of supervised fine-tuning and reinforcement learning from human feedback (RLHF) in GPT-4's training regime.

Implications and Future Directions

The findings indicate that LLMs like GPT-4 have reached a level of competence that positions them as formidable contenders in document-level MT tasks, suggesting their potential to redefine existing paradigms. The research emphasizes the importance of refining evaluation techniques to effectively measure the capabilities of LLMs, given the observed discrepancies between human judgments and automatic scores. The potential for LLMs to model intricate discourse phenomena effectively encourages a reevaluation of established MT frameworks and suggests avenues for further research in their continuous development and application, particularly within tasks that demand high contextual awareness and coherence.

In conclusion, this paper demonstrates that LLMs, through tailored prompting and advanced training techniques, are edging closer to providing higher-quality translations at the document level, bridging the gap between sentence-level translation limitations and the need for cohesive and contextually accurate document translations. Further transparency in training methodologies and continued exploration of innovative evaluation methods will be crucial in advancing LLMs for practical and theoretical applications in machine translation.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Longyue Wang (87 papers)
Chenyang Lyu (44 papers)
Tianbo Ji (10 papers)
Zhirui Zhang (46 papers)
Dian Yu (78 papers)
Shuming Shi (126 papers)
Zhaopeng Tu (135 papers)

Citations (95)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - longyuewangdcu/Document-MT-LLM (114 stars)