The paper presents a technical framework for integrating long chain-of-thought (CoT) reasoning into neural machine translation (MT) for literature, focusing on the challenging task of translating figurative language such as similes and metaphors. The work is structured around both data synthesis and model fine-tuning, with key contributions that can be summarized as follows:
Data Collection and Pre-processing
- Literature Mining:
- The approach begins by extracting approximately 577.6K sentences from over 400 public-domain books sourced from Project Gutenberg.
- A filtering process is applied to select sentences of appropriate lengths and further refine the dataset by employing a LLM (specifically Qwen2.5-72B-Instruct) to identify sentences that include similes or metaphors.
- The selected sentences undergo a literal translation evaluation: if the LLM-generated literal translation is deemed inadequate for native comprehension, those sentences are marked as needing deeper reasoning and are retained. This filtering results in approximately 63K pre-selected sentences.
Multi-Agent Framework for Long Thought Synthesis
- The core of the methodology is a multi-agent system that simulates an iterative translation process. This system comprises three distinct agents:
- Translator:
- Initially performs a word-level translation by identifying key tokens within the source sentence and aligning them with corresponding target language counterparts.
- Generates a preliminary complete translation (denoted as t0) based on both the source sentence and the bilingual keyword pairs.
- Advisor:
- Reviews each translation iteration by providing detailed feedback (fk) aimed at refining the semantic and cultural fidelity of the translation.
- Evaluator:
- Assigns an overall quality score (sk) to the translation at each iteration based on pre-defined evaluation criteria.
- Iterative Refinement:
- Starting from the preliminary output, the system enters a refinement loop where the translator uses the previous translation, advisor feedback, and evaluator score to generate an improved translation.
- The process terminates when the translation score exceeds a preset threshold or when the maximum number of iterations is reached.
- Post-processing:
- To enhance the fluency and readability of the long thought process, the synthesized chain-of-thought is reformulated using GPT-4o, resulting in self-reflective descriptions that encapsulate the iterative reasoning path.
Data Statistics and Training Setup
- The refined process yields 22,264 long thought MT samples, with each sample containing an average of over 500 tokens in the "thought" segment, analogous to previous chain-of-thought datasets developed for math and coding tasks.
- The data is split into training, validation, and testing sets, facilitating the supervised fine-tuning (SFT) of two model variants: DRT-o1-7B and DRT-o1-14B.
- Both models are built upon the Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct backbones, respectively, and the fine-tuning is performed using DeepSpeed with ZeRO-3 optimization. Training details include the use of 8×NVIDIA A100 GPUs, a learning rate of 1e-5, and a total training duration of 70 GPU hours for the 7B model and 124 GPU hours for the 14B model over 3 epochs.
Experimental Evaluation
- Metrics:
- The evaluation utilizes BLEU for n-gram overlap, alongside reference-free CometKiwi and reference-based CometScore for semantic similarity, ensuring a comprehensive appraisal of translation quality.
- Results:
- DRT-o1-7B demonstrates significant performance improvements, achieving an increase of approximately 8.26 BLEU, 1.31 CometKiwi, and 3.36 CometScore over its Qwen2.5-7B-Instruct baseline.
- Similarly, DRT-o1-14B shows an enhancement of roughly 7.33 BLEU, 0.15 CometKiwi, and 1.66 CometScore compared to its Qwen2.5-14B-Instruct counterpart.
- Notably, DRT-o1-7B also outperforms a larger baseline (QwQ-32B-Preview) by 7.82 BLEU and 1.46 CometScore, underscoring the benefits of incorporating long thought in translation tasks.
Relation to Previous Work
- The paper positions its contributions alongside established O1-like models, which have traditionally focused on mathematical and coding tasks requiring complex reasoning. By adapting the chain-of-thought (CoT) paradigm for MT, the proposed framework bridges a gap in applying long reasoning sequences to tasks where cultural and semantic nuances (such as those in literary translation) challenge direct translation techniques.
- The multi-agent iterative refinement strategy deviates from methods like Monte Carlo Tree Search (MCTS) and data distillation used in previous studies, instead explicitly modeling feedback loops that simulate human-like revision processes.
Conclusion
The paper presents a methodologically rigorous approach to enhancing MT for literature by leveraging long chain-of-thought reasoning. The formulation of a multi-agent system—comprising translation, critical feedback, and evaluation—enables the synthesis of detailed reasoning paths that guide translation refinements, thereby yielding significant improvements in standard translation quality metrics. This work provides a meaningful extension of chain-of-thought methodologies to the translation domain, particularly under challenging conditions where figurative language demands extra-layer reasoning.
This detailed exposition should provide a comprehensive technical overview of the paper’s methodology, experimental setup, and empirical contributions.