Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering (2505.05423v3)

Published 8 May 2025 in cs.CL and cs.AI

Abstract: The impact of LLMs has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LiTransProQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LiTransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LiTransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LiTransProQA reaches human-level evaluation performance comparable to trained student evaluators. It shows broad applicability to open-source models like LLaMa3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations. The code and datasets are available under: https://github.com/zhangr2021/TransProQA.

Summary

Evaluation of a Literary Translation Metric Utilizing LLM-Based Question-Answering

The paper introduces an innovative approach for assessing literary translations with a question-answering framework based on LLMs. The primary motivation behind this development is to address the limitations of conventional machine translation (MT) evaluation metrics that often emphasize technical accuracy over artistic integrity. The authors propose a model named TransProQA, which offers a reference-free, LLM-driven evaluation framework emphasizing literary quality aspects such as cultural context, literary device retention, and the author's voice.

Methodology and Results

TransProQA stands out by leveraging professional inputs from literary translators to craft an evaluation metric that is fine-tuned to capture literary nuances, a domain where traditional metrics like BLEU and METEOR fall short. This model is constructed around a question-answering paradigm, where it evaluates translations without the need for comparison to reference translations. The paper conducts extensive evaluations and indicates that TransProQA outperforms current metrics significantly, with performance gains up to 0.07 in correlation measures (Acc-eq and Kendall's tau) and surpasses state-of-the-art adequacy assessments by over 15 percentage points.

The model incorporates a structured question set developed in collaboration with professional translators, ensuring alignment with expert evaluation criteria. The research discusses the inadequacy of existing approaches in evaluating literary translations that require a reinterpretation across cultural and linguistic contexts. Furthermore, the evaluation demonstrates that TransProQA's design facilitates improvements in evaluating high-quality human translations compared to machine outputs, approaching human-level evaluation performance.

Practical Implications and Future Directions

The impact of TransProQA extends beyond academic metrics evaluation; it serves as a tool for understanding the interplay between cultural authenticity and translation quality in literary contexts. This model shows compatibility with open-source models like LLaMA3.3-70b and Qwen2.5-32b, indicating its accessibility and applicability for broader usage without the necessity for proprietary LLMs. This accessibility is crucial for evaluating texts requiring localized processing due to copyright or ethical concerns.

Looking forward, this research poses implications for improving MT outputs by recalibrating systems to prioritize literary expression. It also prompts the exploration of LLM-based metrics in other creative domains, while suggesting that new models could be developed with refined capabilities to understand literary and non-literal nuances better.

Conclusion

TransProQA represents a significant advancement in the field of machine translation evaluation for the literary domain, presenting a model that closely aligns with the insights of human translators. By integrating professional translator insights as weights and demonstrating improvements over conventional metrics, the proposed framework underlines the potential of LLMs to enhance literary translation comprehensively. This research lays the groundwork for future exploration into AI-assisted literary evaluation and translation methodologies that prioritize the preservation of cultural and artistic quality.