LitEval-Corpus: Literary Translation Assessment
- LitEval-Corpus is a curated parallel literary dataset with aligned Korean-English segments designed for evaluating nuanced translation quality in literary texts.
- The methodology integrates agent-based scoring, comparing outputs with metrics like BLEU, METEOR, and OTQS to capture style and narrative consistency.
- It employs paragraph-level segmentation and automated preprocessing to maintain context and accurately assess translation fidelity.
The term LitEval-Corpus designates the composite parallel literary dataset employed in the MAS-LitEval framework for Translation Quality Assessment (TQA) of literary works rendered from Korean into English. While no distinct resource is published explicitly as "LitEval-Corpus," the evaluation protocol and dataset composition are detailed in the MAS-LitEval paper (Kim et al., 17 Jun 2025), which utilizes this specially curated corpus to benchmark both open-source and closed-source LLMs against professional human translation and legacy metrics like BLEU, METEOR, and WMT-KIWI.
1. Composition and Scope
LitEval-Corpus consists of parallel Korean-English segments derived from two canonical literary works:
- The Little Prince (original French, Korean translation as source; ~5,000 words)
- A Connecticut Yankee in King Arthur's Court (original English, Korean translation as source; ~4,000 words)
To increase representativeness, additional parallel literary data from Project Gutenberg Korea and Project Gutenberg were incorporated. Texts are segmented at the paragraph and sentence-pair level, allowing evaluation of individual and aggregate translation features.
| Work | # Paragraphs | # Sentence Pairs | Avg. Sentences/Para (Src) | Avg. Sentences/Para (Tgt) |
|---|---|---|---|---|
| The Little Prince (Kr–En) | 274 | 1812 | 6.6 | 7.0 |
| A Connecticut Yankee... (Kr–En) | 205 | 2545 | 12.2 | 12.8 |
All source segments were translated by various LLMs and professional translators, preserving paragraph alignment for context-sensitive assessment.
2. Linguistic Features and Corpus Structure
The corpus is structured to retain the stylistic and narrative richness of literary texts, with full-document context preserved across 4096-token preprocessing chunks. Each chunk maintains global context for:
- Recurring named entities (for terminology consistency)
- Narrative voice and perspective (for narrator alignment)
- Stylistic elements such as tone, rhythm, and literary devices
The presence of long paragraphs, multi-sentence structures, and distinctive phraseology in literary discourse ensures true evaluation of translation nuances beyond lexico-semantic correspondence.
3. Annotation and Preprocessing
Annotation within LitEval-Corpus is not manual in the traditional sense but is constructed via alignment and automatic segmentation:
- Source language (Korean) and target language (English) sentences/paragraphs are mapped in a parallel structure.
- Preprocessing is performed using spaCy for named entity recognition (critical for terminology evaluation).
- Text chunks (4096 tokens) are used to align evaluation intervals, while agents in MAS-LitEval system maintain cross-chunk global state.
No explicit morphological or syntactic annotation layer is present as seen in morphologically-annotated corpora (e.g., Curras + Baladi (Haff et al., 2022)), but annotation in the context of LitEval-Corpus refers to the segmentation and alignment enabling multi-agent analysis.
4. Application in MAS-LitEval
LitEval-Corpus serves as the evaluation bed for the MAS-LitEval multi-agent system, which benchmarks translation outputs as follows:
- Reference-free evaluation: MAS-LitEval compares source and LLM-generated output without requiring human references, except for baseline comparison with traditional metrics.
- Multi-agent scoring: Terminology, narrative, and stylistic consistency agents compute scores , , in . The overall translation quality score (OTQS) is computed as:
with weights , , (style prioritized).
- Global tracking: Agents capture inter-paragraph and cross-sentential phenomena, such as name drift or narrative voice inconsistency, leveraging the full parallel segment structure of the corpus.
5. Performance Benchmarking and Metric Comparison
LitEval-Corpus enables model comparison on both coarse and fine-grained levels. Evaluation results reported in the MAS-LitEval paper (Kim et al., 17 Jun 2025) include:
| Model | Type | Work | BLEU | METEOR | ROUGE-1 | ROUGE-L | WMT-KIWI | OTQS |
|---|---|---|---|---|---|---|---|---|
| claude-3.7-sonnet | Closed | LP | 0.28 | 0.65 | 0.55 | 0.45 | 0.87 | 0.890 |
| gpt-4o | Closed | LP | 0.30 | 0.67 | 0.57 | 0.47 | 0.85 | 0.875 |
| ... | ... | ... | ... | ... | ... | ... | ... |
The high OTQS (up to 0.890) illustrates the corpus' effectiveness for benchmarking translation fidelity, particularly in literary nuances not captured by BLEU or METEOR.
Correlation analysis demonstrates OTQS aligns more closely with WMT-KIWI (0.93) than with BLEU (0.62) or METEOR (0.70), suggesting the LitEval-Corpus supports more nuanced assessment of literary translation.
6. Corpus Limitations and Prospects
The LitEval-Corpus covers two literary works with enrichment from public domain sources, focusing on paragraph-aligned parallel Korean-English data. The limited coverage (prose only, no poetry/drama) constrains generalizability across broader literary genres. Subjectivity in stylistic interpretation by LLM-based agents is present; no explicit human annotation or consensus exists.
A plausible implication is that expansion to a more diverse corpus may improve metric robustness and enable direct modeling of other literatures and source languages. The absence of a distinct downloadable LitEval-Corpus resource means all insights are derived from the specific experimental protocol rather than a standardized dataset.
7. Impact and Accessibility
LitEval-Corpus functions as the empirical foundation for reference-free, multi-agent literary translation assessment, supporting scalable evaluation across different LLM outputs and potentially extensible to other genres and language pairs. By enabling document-level and stylistic evaluation, it advances methodological standards for literary TQA beyond string-based similarity metrics.
No direct public availability portal for LitEval-Corpus is specified, and release of the full corpus as a standalone resource is pending future work, in contrast to resources like Curras + Baladi (Haff et al., 2022), which are publicly accessible at portal.sina.birzeit.edu/curras.
In summary, LitEval-Corpus is the parallel, paragraph-aligned Korean-English dataset used in MAS-LitEval for multidimensional assessment of literary translation quality, facilitating both agent-based analysis and benchmarking of modern LLM translation models. Its structure and function exemplify emerging paradigms in corpus design for high-fidelity evaluation of creative and literary translation.