Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs (2410.18697v1)

Published 24 Oct 2024 in cs.CL and cs.AI
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

Abstract: Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.

The paper conducts an in‐depth evaluation of literary translation quality by developing a new annotated corpus, LitEval-Corpus, which consists of over 2,000 paragraph-level segments and approximately 13,000 sentences, spanning four language pairs. The corpus incorporates verified human translations (covering both classical and contemporary works) alongside outputs from nine diverse machine translation systems. These systems include commercial models, transformer-based sentence-level models, and various sizes of open-source LLM systems, with particular attention paid to comparing outputs from recent LLMs such as GPT-4o with earlier systems.

The evaluation framework is multifaceted and examines three human annotation schemes:

  • Multidimensional Quality Metrics (MQM): An error-span-based method that follows specific categorization guidelines.
  • Scalar Quality Metric (SQM): A Likert-type rating scale ranging from 0 to 6 that assesses overall quality, with particular emphasis on stylistic and aesthetic aspects.
  • Best-Worst Scaling (BWS): A direct comparative approach that requires annotators to select the best and worst outputs among a subset of systems.

Human evaluations are conducted by both student annotators (with basic linguistic and translation training) and professional translators with established publication records. The intra-annotator and inter-annotator agreement analyses indicate that while MQM and SQM produce moderate agreement levels—Kendall’s tau in the range of approximately 0.43 to 0.66—the BWS method tends to yield slightly better consistency (Cohen’s kappa averaging around 0.57). Notably, discrepancies are observed between student and professional evaluations, with professional SQM evaluations preferring human translations at rates approaching 100% in certain language pairs (e.g., De-En and De-Zh), in stark contrast to student MQM and SQM where preferences for human translation hover around 42%.

The paper also benchmarks several automatic evaluation metrics. In particular, GEMBA-MQM—both in its original form and a variant adapted with literary-specific knowledge—is compared against metrics such as Prometheus 2, XCOMET-XL, and XCOMET-XXL. Key findings include:

  • Correlation with Human Judgments: GEMBA-MQM consistently demonstrates moderate segment-level correlation with human MQM scores. However, despite its relative superiority over other automatic methods, it struggles to reliably differentiate between high-quality human translations and outputs from top-performing LLMs. For example, GEMBA-MQM (Literary) only favors human translations over those from top LLM systems in roughly 9.6% of cases, which is significantly lower than the near-94% preference indicated by professional SQM evaluations.
  • Aspect Sensitivity: An analysis of correlations across error categories reveals that all the state-of-the-art metrics predominantly capture Accuracy-related errors while exhibiting much weaker performance when evaluating Fluency, Style, and Terminology. For instance, while Accuracy errors correlate strongly between human and automatic scores—even surpassing full correlation in select language pairs—the correlations for other essential dimensions of literary translation remain poor.
  • Model Biases: A novel analysis of syntactic similarity and lexical diversity indicates that LLM outputs tend to be more literal, displaying higher syntactic similarity to the source text, and are less lexically diverse compared to human translations. Scatter plots of human MQM scores, syntactic similarity, and average pairwise lexical overlap reveal that human translations uniquely achieve high quality with lower syntactic similarity (approximately 0.21–0.23) and lower lexical overlap (around 18.9–23.0), suggesting that human translators introduce variability critical to literary expression.

Additional numerical findings include:

  • A clear performance gap where even the best automatic metric is off by approximately 32–40 percentage points in certain scenarios when compared to human evaluators, particularly in distinguishing human translations from top machine outputs.
  • System ranking based on both human evaluations and automatic metrics consistently place professional human translations at the top, with GPT-4o ranking second, followed by Google Translate and DeepL (or Qwen where applicable). The margin between human translations and GPT-4o is notably 1.8 points in professional SQM, underscoring the persisting gap in stylistic and creative quality.

In summary, the research provides a comprehensive framework for evaluating literary translation quality through an extensive, verified corpus and multiple evaluation methods. It systematically documents the limitations of traditional metrics such as MQM when applied to literary texts and highlights the necessity of using more nuanced approaches (e.g., BWS and expert SQM) to capture aesthetic and stylistic subtleties. Moreover, the findings articulate that, despite significant advancements in LLM performance, these systems generally produce translations that are more literal and less stylistically diverse than those crafted by professional human translators. This work offers a salient baseline for future metric development focused on the more challenging aspects of literary translation, including fluency, style, and terminology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. AI@Meta. 2024. Llama 3 model card.
  2. Nabil Al-Awawdeh. 2021. Translation between creativity and reproducing an equivalent original text. Psychology and Education Journal, 58(1):2559–2564.
  3. Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733.
  4. Jonas Belouadi and Steffen Eger. 2023. ByGPT5: End-to-end style-conditioned poetry generation with token-free language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7364–7381, Toronto, Canada. Association for Computational Linguistics.
  5. Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
  6. Findings of the WMT 2023 shared task on quality estimation. In Proceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics.
  7. FastKASSIM: A fast tree kernel-based syntactic similarity metric. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 211–231, Dubrovnik, Croatia. Association for Computational Linguistics.
  8. Yanran Chen and Steffen Eger. 2023. Menli: Robust evaluation metrics from natural language inference. Transactions of the Association for Computational Linguistics, 11:804–825.
  9. Evaluating diversity in automatic poetry generation. arXiv preprint arXiv:2406.15267.
  10. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  11. Impact of translation workflows with and without MT on textual characteristics in literary translation. In Proceedings of the 1st Workshop on Creative-text Translation and Technology, pages 57–64, Sheffield, United Kingdom. European Association for Machine Translation.
  12. Training and meta-evaluating machine translation evaluation metrics at the paragraph level. In Proceedings of the Eighth Conference on Machine Translation, pages 996–1013, Singapore. Association for Computational Linguistics.
  13. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res., 22(1).
  14. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  15. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
  16. Ana Guerberof-Arenas and Antonio Toral. 2022. Creativity in translation: Machine translation as a constraint for literary texts. Translation Spaces, 11(2):184–212.
  17. xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics, 12:979–995.
  18. Damien Hansen and Emmanuelle Esperança-Rodier. 2022. Human-adapted mt for literary texts: Reality or fantasy? In NeTTT 2022, pages 178–190.
  19. BlonDe: An automatic evaluation metric for document-level machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1550–1565, Seattle, United States. Association for Computational Linguistics.
  20. Marzena Karpinska and Mohit Iyyer. 2023. Large language models effectively leverage document-level context for literary translation, but critical errors persist. In Proceedings of the Eighth Conference on Machine Translation, pages 419–451.
  21. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535.
  22. Svetlana Kiritchenko and Saif Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 465–470, Vancouver, Canada. Association for Computational Linguistics.
  23. Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
  24. Waltraud Kolb. 2023. ‘i am a bit surprised’: Literary translation and post-editing processes compared. In Computer-Assisted Literary Translation, pages 53–68. Routledge.
  25. Kollektive-Intelligenz. 2024. Ki – aber wie? Übersetzertag 2024.
  26. Christoph Leiter and Steffen Eger. 2024. Prexme! large scale prompt exploration of open source llms for machine translation and summarization evaluation. arXiv preprint arXiv:2406.18528.
  27. The eval4nlp 2023 shared task on prompting large language models as explainable metrics. arXiv preprint arXiv:2310.19792.
  28. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  29. LLMs as narcissistic evaluators: When ego inflates evaluation scores. In Findings of the Association for Computational Linguistics ACL 2024, pages 12688–12701, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  30. Using a new analytic measure for the annotation and analysis of mt errors on real data. In Proceedings of the 17th Annual conference of the European Association for Machine Translation, pages 165–172.
  31. Mqm-ape: Toward high-quality error annotation predictors with automatic post-editing in llm translation evaluators. arXiv preprint arXiv:2409.14335.
  32. Lieve Macken. 2024. Machine translation meets large language models: Evaluating ChatGPT’s ability to automatically post-edit literary texts. In Proceedings of the 1st Workshop on Creative-text Translation and Technology, pages 65–81, Sheffield, United Kingdom. European Association for Machine Translation.
  33. Ruth Martin. 2023. Reflections from an ai translation slam: (wo)man versus machine.
  34. Evgeny Matusov. 2019. The challenges of using neural machine translation for literature. In Proceedings of the Qualities of Literary Machine Translation, pages 10–19, Dublin, Ireland. European Association for Machine Translation.
  35. Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282.
  36. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12:933–949.
  37. Monika Pfundmeier. Redaktion: Dorrit Bartel Nina George, André Hansen. 2023. Anwendungen von fortgeschrittener informatik und generativer systeme im buchsektor.
  38. Magdalena Nizioł. 2024. Künstliche intelligenz – gefahr oder hilfe für literarische Übersetzung?
  39. Salute the classic: Revisiting challenges of machine translation in the age of large language models. arXiv preprint arXiv:2401.08350.
  40. S Patro. 2015. Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462.
  41. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  42. Llukan Puka. 2011. Kendall’s Tau, pages 713–715. Springer Berlin Heidelberg, Berlin, Heidelberg.
  43. Natália Resende and James Hadley. 2024. The translator’s canvas: Using LLMs to enhance poetry translation. In Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 178–189, Chicago, USA. Association for Machine Translation in the Americas.
  44. Whence the 3 percent?: How far have we come toward decentering america’s literary preference? Global Perspectives, 5(1):93034.
  45. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  46. Bryan Stroube. 2003. Literary freedom: Project gutenberg. XRDS: Crossroads, The ACM Magazine for Students, 10(1):3–3.
  47. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  48. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  49. Exploring document-level literary machine translation with parallel paragraphs from world literature. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9882–9902.
  50. Common flaws in running human evaluation experiments in NLP. Computational Linguistics, 50(2):795–805.
  51. Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
  52. The riddle of (literary) machine translation quality. Tradumàtica tecnologies de la traducció, (21):129–159.
  53. Proceedings of the 1st Workshop on Creative-text Translation and Technology. European Association for Machine Translation, Sheffield, United Kingdom.
  54. Rob Voigt and Dan Jurafsky. 2012. Towards a literary machine translation: The role of referential cohesion. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pages 18–25, Montréal, Canada. Association for Computational Linguistics.
  55. Findings of the WMT 2023 shared task on discourse-level literary translation: A fresh orb in the cosmos of LLMs. In Proceedings of the Eighth Conference on Machine Translation, pages 55–67, Singapore. Association for Computational Linguistics.
  56. (perhaps) beyond human translation: Harnessing multi-agent collaboration for translating ultra-long literary texts. arXiv preprint arXiv:2405.11804.
  57. GuoFeng: A benchmark for zero pronoun recovery and translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11266–11278, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  58. Gpt-4 vs. human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels. arXiv preprint arXiv:2407.03658.
  59. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  60. Ran Zhang and Steffen Eger. 2024. Llm-based multi-agent poetry generation in non-cooperative environments. arXiv preprint arXiv:2409.03659.
  61. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  62. Discoscore: Evaluating text generation with bert and discourse coherence. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3865–3883.
  63. More human than human: Llm-generated narratives outperform human-llm interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition, pages 368–370.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ran Zhang (89 papers)
  2. Wei Zhao (309 papers)
  3. Steffen Eger (90 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com