How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation (2302.09210v1)

Published 18 Feb 2023 in cs.CL

Abstract: Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation. We experiment with eighteen different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. We perform comprehensive analysis and human evaluation to further understand the characteristics of GPT translations. We hope that our paper provides valuable insights for researchers and practitioners in the field and helps to better understand the potential and limitations of GPT models for translation.

PDF Abstract

An Evaluation of GPT Models for Machine Translation

The paper "How Good Are GPT Models at Machine Translation?" presents a comprehensive evaluation of the multilingual capabilities of Generative Pre-trained Transformer (GPT) models for machine translation. This paper addresses the critical gap in the assessment of GPT models' efficacy in translation, benchmarking their performance against state-of-the-art research and commercial systems and analyzing prompt strategies, robustness to domain shifts, and document-level translation capacities.

Methodology and Experimental Setup

The researchers evaluated three prominent GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. They employed a thorough experimental setup involving 18 language pairs, including both high- and low-resource languages and non-English-centric translations, using the latest WMT22 test sets with clear distinctions between natural and translational text origins. The paper utilized COMET-22 and COMETkiwi, alongside traditional metrics like SacreBLEU and ChrF, to evaluate translation quality, factoring in sentence and document levels.

Key Findings

Translation Performance:
- GPT models demonstrated strong performance for high-resource languages, especially in zero-shot configurations, with ChatGPT and GPT3.5 outperforming text-davinci-002.
- Few-shot learning with quality prompts showed improvements, yet these gains were limited in translating from English to other languages, with 5-shots often yielding marginal benefits over zero-shots.
Prompting Strategies and Document-level Translation:
- High-quality prompts and relevance-driven selections greatly enhance translation capabilities. However, further improvements from additional shots were inconsistent across language pairs.
- Document-level translation benefits from increased context window sizes, showing potential over sentence-level translation, especially when computational efficiency is considered.
Domain Robustness and Hybrid Systems:
- The GPT models displayed robustness across varied domains, with notable strength in conversational tasks.
- A hybrid approach, combining GPT with traditional NMT systems, demonstrated significant quality enhancements in translation, indicating a pathway for future system integration that maximizes the strengths of both paradigms.
Language Characteristics and Biases:
- GPT translations were often more fluent and non-monotonic than their NMT counterparts, exhibiting a propensity for punctuation insertions and higher non-aligned source words.
- The analysis also highlighted that GPT could better handle translations with noisy or ill-formed inputs, providing advantages in specific domains usually affected by parallel data biases.
Beyond Translation - Multilingual Capabilities:
- The paper extends GPT's evaluation beyond translation to multilingual reasoning tasks, revealing limitations in reasoning tasks across non-Latino languages, suggesting differential support based on the model's training data distribution.

Implications and Future Directions

The findings highlight the competitive translation ability of GPT models for high-resource languages while underlining the challenges in achieving equivalent performance for low-resource languages. The integration of GPT in hybrid systems offers a promising model for enhanced translation quality, suggesting a significant potential for task-specific optimization and system efficiency improvements. The paper also underscores the need for better evaluation metrics that transcends simple lexical comparisons to truly capture the nuanced context and fluency of GPT outputs.

Future research should focus on three central aspects: enhancing support for underrepresented languages, refining in-context learning strategies for more nuanced linguistic outputs, and developing more sophisticated fusion techniques for hybrid systems. Additionally, the exploration of the relationship between translation capabilities and broader multilingual tasks warrants deeper investigation to achieve more equitable AI systems across language spectra.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Amr Hendy (8 papers)
Mohamed Abdelrehim (2 papers)
Amr Sharaf (13 papers)
Vikas Raunak (25 papers)
Mohamed Gabr (5 papers)
Hitokazu Matsushita (3 papers)
Young Jin Kim (31 papers)
Mohamed Afify (10 papers)
Hany Hassan Awadalla (24 papers)

Citations (353)

View on Semantic Scholar

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation (2302.09210v1)

An Evaluation of GPT Models for Machine Translation

Methodology and Experimental Setup

Key Findings

Implications and Future Directions

Related Papers