MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment (2506.14199v1)

Published 17 Jun 2025 in cs.CL

Abstract: Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS-LitEval, a multi-agent system using LLMs to evaluate translations based on terminology, narrative, and style. We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur's Court, generated by various LLMs, and compared it to traditional metrics. \textbf{MAS-LitEval} outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.

Summary

The paper introduces a multi-agent system where specialized agents evaluate terminology, narrative, and style in literary translations.
Methodology integrates LLM APIs and spaCy to maintain global context by chunking texts while ensuring consistency across evaluations.
Experimental results show MAS-LitEval outperforms traditional metrics like BLEU and METEOR, reflecting closer alignment with human judgment.

MAS-LitEval: Multi-Agent System for Literary Translation Quality Assessment

Introduction

The paper "MAS-LitEval: Multi-Agent System for Literary Translation Quality Assessment" presents a novel framework designed to address the inherent complexities of evaluating literary translations. Unlike technical translations, literary translation necessitates a preservation of stylistic nuances and cultural subtleties, which traditional evaluation metrics such as BLEU and METEOR fail to capture effectively. The proposed MAS-LitEval system leverages a multi-agent structure incorporating LLMs to assess translation quality along multiple dimensions—terminology, narrative, and style—thereby offering a comprehensive tool for Translation Quality Assessment (TQA) in literary domains.

Methodology

System Architecture

MAS-LitEval deploys a multi-agent system where each agent is tasked with assessing a specific aspect of translation quality:

Terminology Consistency Agent: This agent uses Named Entity Recognition (NER) to ensure terms like character names remain consistent throughout the translation.
Narrative Perspective Consistency Agent: This agent maintains the narrative's integrity by ensuring that perspectives are consistently translated.
Stylistic Consistency Agent: This agent evaluates whether the tone, rhythm, and stylistic attributes of the source are faithfully reproduced in the translation.

The outputs from these agents are integrated into an Overall Translation Quality Score (OTQS) by a coordinator, which weights the stylistic consistency most heavily, reflecting its significant role in literary translation.

Implementation Strategy

MAS-LitEval is implemented in Python, utilizing spaCy for preprocessing and providing interfaces to LLMs via APIs. Texts are divided into manageable chunks, but the system maintains a holistic context by ensuring agents evaluate global consistency across the entire text. By capitalizing on the inherent strengths of various LLMs and distributing tasks across specialized agents, MAS-LitEval promises improved accuracy and efficiency over conventional metrics.

Experimental Evaluation

The effectiveness of MAS-LitEval was evaluated using translations of "The Little Prince" and "A Connecticut Yankee in King Arthur's Court," produced by a mix of closed-source and open-source LLMs. The MAS-LitEval's performance was benchmarked against traditional baselines like BLEU, METEOR, and ROUGE, along with WMT-KIWI, using human reference translations for comparative analysis.

Results evidenced that MAS-LitEval significantly outperforms traditional methods. Closed-source models like Claude-3.7-sonnet demonstrated superior performance in maintaining stylistic and narrative nuances, achieving OTQS of up to 0.890. In contrast, open-source models showed limitations in capturing these complexities, highlighted by their lower scores.

Findings

MAS-LitEval excels in providing a nuanced evaluation of literary translations, surpassing the capabilities of existing metrics. The multi-agent approach enables a detailed analysis addressing various quality aspects, leading to evaluation results that align more closely with human judgment when compared to traditional metrics. The inclusion of stylistic evaluation is particularly noteworthy, as it captures literary qualities that BLEU and METEOR typically overlook.

Discussion

The MAS-LitEval framework underscores the limitations of traditional TQA metrics and highlights the potential of multi-agent systems integrated with LLMs for sophisticated evaluation tasks. The robust performance of closed versus open-source models raises considerations about the accessibility and deployment of model resources for translation tasks.

However, challenges persist, notably in the subjective nature of stylistic and narrative evaluations, which may be influenced by biases in LLM training data. Future refinements might involve human-in-the-loop systems to enhance calibration, as well as expanded datasets encompassing broader genres and styles to improve generalizability.

Conclusion

MAS-LitEval provides a significant advancement in the field of literary TQA by effectively capturing the multifaceted aspects of translation quality. Through its innovative multi-agent architecture, it offers a practical and scalable tool for literary translation evaluation, proving its efficacy and relevance as an essential asset for translators and researchers in advancing the quality of machine-assisted literary translations. Future efforts should focus on addressing its current limitations and expanding its applicability across diverse literary works, enhancing its modeling of cultural and stylistic fidelity.