Large Language Models Are State-of-the-Art Evaluators of Translation Quality (2302.14520v2)

Published 28 Feb 2023 in cs.CL

Abstract: We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative LLMs for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

PDF Abstract

Assessment of Translation Quality Using LLMs: An Expert Overview

The paper entitled "LLMs Are State-of-the-Art Evaluators of Translation Quality" by Kocmi and Federmann presents GEMBA, a novel metric for evaluating the quality of translations. GEMBA leverages generative LLMs, specifically focusing on models in the GPT family, to assess the quality of translations both with and without a reference translation. The paper methodically compares multiple variants of GPT models, including GPT-3.5 and GPT-4, and concludes that only models from GPT-3.5 onwards showcase the capability for effective quality assessments.

Core Contributions

State-of-the-Art Performance: The research highlights that GEMBA achieves state-of-the-art accuracy when evaluated against MQM-based human labels in the WMT22 Metrics shared task. It outperforms existing metrics in different translation directions, specifically English to German, English to Russian, and Chinese to English, at a system level, demonstrating the efficacy of using LLMs for translation quality assessment.
Prompt Variants and Modes: The paper explores four prompt variants—GEMBA-DA, GEMBA-SQM, GEMBA-stars, and GEMBA-classes—each assessed in scenarios with and without reference translations. Notably, the least constrained prompt template yielded the highest performance, underscoring the importance of prompt design in LLM applications.
Evaluation Across Different GPT Models: A thorough investigation of various GPT models revealed that while smaller models like GPT-2 and Ada were ineffective, models from GPT-3.5 onwards showed significant potential. The superiority of GPT-4 in this task was particularly notable.
Public Release for Reproducibility: All code, prompt templates, and corresponding scoring results have been made publicly available, fostering transparency and enabling further research and validation efforts by the AI research community.

Implications and Future Research Directions

The implications of this research are multifaceted. Practically, the ability to automate translation quality assessment has promising applications in machine translation services, potentially reducing reliance on human evaluators. Theoretically, the work provides insights into the capabilities of LLMs, contributing to the broader discourse on their applicability beyond traditional tasks.

This paper paves the way for future enhancements, such as fine-tuning LLMs for translation evaluation tasks and exploring few-shot learning scenarios to potentially augment GEMBA's accuracy. Additionally, expansion into document-level evaluation could address existing gaps, taking advantage of the larger context windows that LLMs support.

Critiques and Considerations

While the paper convincingly demonstrates the power of LLMs in quality assessment, it is important to note the potential constraints related to language diversity. The research's findings are supported by data from high-resource language pairs, and it would be vital to test these methods on low-resource languages, where performance could vary. Also, the deterministic nature of these models when faced with invalid outputs necessitates further investigation into enhancing prompt robustness.

The inability to definitively confirm the exclusion of evaluation data in the model's training set highlights a broader challenge in LLM transparency. This underlines the necessity for rigorous experimental controls and data provenance in LLM research.

In summary, the paper by Kocmi and Federmann significantly advances the understanding and practical application of LLMs in translation quality assessment. By establishing a benchmark for future research in this domain, it invites the AI community to build upon these findings and explore the unresolved challenges revealed by this pioneering work.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Tom Kocmi (29 papers)
Christian Federmann (9 papers)

Citations (267)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/Drachs1978/status/1749603612395987062

https://twitter.com/alvations/status/1778580807252361280