How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs (2410.18697v1)

Published 24 Oct 2024 in cs.CL and cs.AI

Abstract: Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.

PDF HTML Abstract

The paper conducts an in‐depth evaluation of literary translation quality by developing a new annotated corpus, LitEval-Corpus, which consists of over 2,000 paragraph-level segments and approximately 13,000 sentences, spanning four language pairs. The corpus incorporates verified human translations (covering both classical and contemporary works) alongside outputs from nine diverse machine translation systems. These systems include commercial models, transformer-based sentence-level models, and various sizes of open-source LLM systems, with particular attention paid to comparing outputs from recent LLMs such as GPT-4o with earlier systems.

The evaluation framework is multifaceted and examines three human annotation schemes:

Multidimensional Quality Metrics (MQM): An error-span-based method that follows specific categorization guidelines.
Scalar Quality Metric (SQM): A Likert-type rating scale ranging from 0 to 6 that assesses overall quality, with particular emphasis on stylistic and aesthetic aspects.
Best-Worst Scaling (BWS): A direct comparative approach that requires annotators to select the best and worst outputs among a subset of systems.

Human evaluations are conducted by both student annotators (with basic linguistic and translation training) and professional translators with established publication records. The intra-annotator and inter-annotator agreement analyses indicate that while MQM and SQM produce moderate agreement levels—Kendall’s tau in the range of approximately 0.43 to 0.66—the BWS method tends to yield slightly better consistency (Cohen’s kappa averaging around 0.57). Notably, discrepancies are observed between student and professional evaluations, with professional SQM evaluations preferring human translations at rates approaching 100% in certain language pairs (e.g., De-En and De-Zh), in stark contrast to student MQM and SQM where preferences for human translation hover around 42%.

The paper also benchmarks several automatic evaluation metrics. In particular, GEMBA-MQM—both in its original form and a variant adapted with literary-specific knowledge—is compared against metrics such as Prometheus 2, XCOMET-XL, and XCOMET-XXL. Key findings include:

Correlation with Human Judgments: GEMBA-MQM consistently demonstrates moderate segment-level correlation with human MQM scores. However, despite its relative superiority over other automatic methods, it struggles to reliably differentiate between high-quality human translations and outputs from top-performing LLMs. For example, GEMBA-MQM (Literary) only favors human translations over those from top LLM systems in roughly 9.6% of cases, which is significantly lower than the near-94% preference indicated by professional SQM evaluations.
Aspect Sensitivity: An analysis of correlations across error categories reveals that all the state-of-the-art metrics predominantly capture Accuracy-related errors while exhibiting much weaker performance when evaluating Fluency, Style, and Terminology. For instance, while Accuracy errors correlate strongly between human and automatic scores—even surpassing full correlation in select language pairs—the correlations for other essential dimensions of literary translation remain poor.
Model Biases: A novel analysis of syntactic similarity and lexical diversity indicates that LLM outputs tend to be more literal, displaying higher syntactic similarity to the source text, and are less lexically diverse compared to human translations. Scatter plots of human MQM scores, syntactic similarity, and average pairwise lexical overlap reveal that human translations uniquely achieve high quality with lower syntactic similarity (approximately 0.21–0.23) and lower lexical overlap (around 18.9–23.0), suggesting that human translators introduce variability critical to literary expression.

Additional numerical findings include:

A clear performance gap where even the best automatic metric is off by approximately 32–40 percentage points in certain scenarios when compared to human evaluators, particularly in distinguishing human translations from top machine outputs.
System ranking based on both human evaluations and automatic metrics consistently place professional human translations at the top, with GPT-4o ranking second, followed by Google Translate and DeepL (or Qwen where applicable). The margin between human translations and GPT-4o is notably 1.8 points in professional SQM, underscoring the persisting gap in stylistic and creative quality.

In summary, the research provides a comprehensive framework for evaluating literary translation quality through an extensive, verified corpus and multiple evaluation methods. It systematically documents the limitations of traditional metrics such as MQM when applied to literary texts and highlights the necessity of using more nuanced approaches (e.g., BWS and expert SQM) to capture aesthetic and stylistic subtleties. Moreover, the findings articulate that, despite significant advancements in LLM performance, these systems generally produce translations that are more literal and less stylistically diverse than those crafted by professional human translators. This work offers a salient baseline for future metric development focused on the more challenging aspects of literary translation, including fluency, style, and terminology.

PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (3)

Ran Zhang (89 papers)
Wei Zhao (309 papers)
Steffen Eger (90 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/NLLG_lab/status/1882381385308246283

How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs (2410.18697v1)

Related Papers

Tweets