The paper conducts an in‐depth evaluation of literary translation quality by developing a new annotated corpus, LitEval-Corpus, which consists of over 2,000 paragraph-level segments and approximately 13,000 sentences, spanning four language pairs. The corpus incorporates verified human translations (covering both classical and contemporary works) alongside outputs from nine diverse machine translation systems. These systems include commercial models, transformer-based sentence-level models, and various sizes of open-source LLM systems, with particular attention paid to comparing outputs from recent LLMs such as GPT-4o with earlier systems.
The evaluation framework is multifaceted and examines three human annotation schemes:
- Multidimensional Quality Metrics (MQM): An error-span-based method that follows specific categorization guidelines.
- Scalar Quality Metric (SQM): A Likert-type rating scale ranging from 0 to 6 that assesses overall quality, with particular emphasis on stylistic and aesthetic aspects.
- Best-Worst Scaling (BWS): A direct comparative approach that requires annotators to select the best and worst outputs among a subset of systems.
Human evaluations are conducted by both student annotators (with basic linguistic and translation training) and professional translators with established publication records. The intra-annotator and inter-annotator agreement analyses indicate that while MQM and SQM produce moderate agreement levels—Kendall’s tau in the range of approximately 0.43 to 0.66—the BWS method tends to yield slightly better consistency (Cohen’s kappa averaging around 0.57). Notably, discrepancies are observed between student and professional evaluations, with professional SQM evaluations preferring human translations at rates approaching 100% in certain language pairs (e.g., De-En and De-Zh), in stark contrast to student MQM and SQM where preferences for human translation hover around 42%.
The paper also benchmarks several automatic evaluation metrics. In particular, GEMBA-MQM—both in its original form and a variant adapted with literary-specific knowledge—is compared against metrics such as Prometheus 2, XCOMET-XL, and XCOMET-XXL. Key findings include:
- Correlation with Human Judgments: GEMBA-MQM consistently demonstrates moderate segment-level correlation with human MQM scores. However, despite its relative superiority over other automatic methods, it struggles to reliably differentiate between high-quality human translations and outputs from top-performing LLMs. For example, GEMBA-MQM (Literary) only favors human translations over those from top LLM systems in roughly 9.6% of cases, which is significantly lower than the near-94% preference indicated by professional SQM evaluations.
- Aspect Sensitivity: An analysis of correlations across error categories reveals that all the state-of-the-art metrics predominantly capture Accuracy-related errors while exhibiting much weaker performance when evaluating Fluency, Style, and Terminology. For instance, while Accuracy errors correlate strongly between human and automatic scores—even surpassing full correlation in select language pairs—the correlations for other essential dimensions of literary translation remain poor.
- Model Biases: A novel analysis of syntactic similarity and lexical diversity indicates that LLM outputs tend to be more literal, displaying higher syntactic similarity to the source text, and are less lexically diverse compared to human translations. Scatter plots of human MQM scores, syntactic similarity, and average pairwise lexical overlap reveal that human translations uniquely achieve high quality with lower syntactic similarity (approximately 0.21–0.23) and lower lexical overlap (around 18.9–23.0), suggesting that human translators introduce variability critical to literary expression.
Additional numerical findings include:
- A clear performance gap where even the best automatic metric is off by approximately 32–40 percentage points in certain scenarios when compared to human evaluators, particularly in distinguishing human translations from top machine outputs.
- System ranking based on both human evaluations and automatic metrics consistently place professional human translations at the top, with GPT-4o ranking second, followed by Google Translate and DeepL (or Qwen where applicable). The margin between human translations and GPT-4o is notably 1.8 points in professional SQM, underscoring the persisting gap in stylistic and creative quality.
In summary, the research provides a comprehensive framework for evaluating literary translation quality through an extensive, verified corpus and multiple evaluation methods. It systematically documents the limitations of traditional metrics such as MQM when applied to literary texts and highlights the necessity of using more nuanced approaches (e.g., BWS and expert SQM) to capture aesthetic and stylistic subtleties. Moreover, the findings articulate that, despite significant advancements in LLM performance, these systems generally produce translations that are more literal and less stylistically diverse than those crafted by professional human translators. This work offers a salient baseline for future metric development focused on the more challenging aspects of literary translation, including fluency, style, and terminology.