DocMT: Document-Level Machine Translation

Updated 10 December 2025

Document-Level Machine Translation is a specialized field that translates full documents by incorporating global context to resolve issues like coreference, lexical cohesion, and consistent terminology.
It employs various techniques such as context-aware encoders, hierarchical attention, memory modules, and in-context prompting to achieve document coherence and fluency.
Evaluation metrics like d-BLEU, document-COMET, and specialized discourse tests reveal that while improvements may be modest in conventional scores, they significantly enhance overall translation quality and coherence.

Document-Level Machine Translation (DocMT) is the field of machine translation concerned with translating coherent units of text (documents), rather than treating sentences as independent inputs. This regime aims to address discourse-level phenomena such as coreference resolution, lexical cohesion, consistent terminology, and the preservation of discourse markers—challenges unaddressable by sentence-level MT systems that ignore broader context.

1. Motivations, Definitions, and Discourse Phenomena

DocMT is motivated by pervasive discourse phenomena that exist in natural language documents. Pronouns often refer to antecedents in previous sentences, lexical choices for polysemous words depend on discourse context, and terminological as well as stylistic consistency only makes sense at document scope (Maruf et al., 2019). The formal objective is to model the conditional probability

$P(\mathbf{Y}|\mathbf{X},C)$

where $\mathbf{X} = \{x^1,\dots,x^K\}$ is the source document, $\mathbf{Y} = \{y^1,\dots,y^K\}$ is the target document, and $C$ denotes the available document context (preceding/following sentences, the entire document, or even external knowledge) (Maruf et al., 2019). A typical left-to-right decomposition is

$P(\mathbf{Y}|\mathbf{X}) = \prod_{j=1}^K P(y^j|x^j,C^j)$

with $C^j$ denoting local or global context.

Discourse errors in MT manifest as incorrect pronoun gender/number, broken referential chains, inconsistent translation of entities or technical terms, and disfluent or incoherent text (Maruf et al., 2019, Kim et al., 2019). Addressing these requires context-aware modeling beyond the reach of classic sentence-level NMT.

2. Core Modeling Techniques

A variety of neural DocMT architectures have been proposed:

(a) Context-Aware Encoders and Decoders

Multi-source or multi-encoder architectures process the current sentence and the context sentence(s) in parallel, fusing representations via gating or hierarchical attention (Maruf et al., 2019, Kim et al., 2019, Macé et al., 2019).
Hierarchical models extract sentence embeddings and combine them for global context.
G-Transformer introduces "group-attention," restricting attention scope across sentence boundaries to avoid local minima in attention learning and scaling to full documents (Bao et al., 2021).
Memory-augmented NMT integrates external source and target memories via memory networks or recurrent memory modules, supporting global context flow throughout translation (Maruf et al., 2017, Feng et al., 2022).

(b) Structured Decoding/Inference

Incremental strategies (such as forced decoding with neighboring sentence prefixes) ensure complete translation and smooth transitions, with memory modules for consistent terminology (Guo et al., 15 Jan 2025).
Agentic and graph-based frameworks (e.g. GRAFT) first segment documents into discourse units, build a dependency DAG for context flow, and rely on LLM-based agents to manage segmentation, edge determination, memory extraction, and translation, propagating entity/pronoun/phrase mappings and connectives (Dutta et al., 4 Jul 2025).
Two-pass systems first generate sentence-wise translations, then refine full documents using post-editing models or automatic rewriters for coherence (Junczys-Dowmunt, 2019, Unanue et al., 2020, Bao et al., 2023).

(c) In-Context Learning and Prompting with LLMs

Context-aware prompting (CAP) uses the model's own attention to dynamically select relevant context, summarizes it, retrieves few-shot demonstrations from an external datastore, and composes prompts accordingly (Cui et al., 11 Jun 2024).
Multi-knowledge fusion generates and integrates summarization and entity translation knowledge to guide LLM document translation, fusing outputs using quality estimation (Liu et al., 15 Mar 2025).
Whole-document prompting for instruction-tuned LLMs (e.g. full DOC prompt) realizes better coherence and accuracy than sentence batching, even without document-level fine-tuning (Sun et al., 28 Oct 2024).

(d) Data-Driven and Hybrid Approaches

Back-translation at the document level (DocBT) creates synthetic document-aligned bitext from monolingual corpora, achieving substantial gains—often rivaling or outperforming architecture-heavy models (Ma et al., 2021).
Posteriors from document-level data augmentation models inject alternative plausible targets to smooth the training distribution and reduce overfitting (Bao et al., 2023).

3. Training Objectives, Data Regimes, and Model Adaptation

DocMT training utilizes both parallel and monolingual data:

Hierarchical and memory-based models are trained on pseudo-likelihood or multi-task objectives, e.g., jointly optimizing translation and context prediction losses (Zhang et al., 2020, Maruf et al., 2017).
Domain adaptation and parameter-efficient fine-tuning (LoRA) are crucial for LLMs when targeting DocMT, achieving data efficiency and reducing overfitting (Wu et al., 12 Jan 2024).
Zero-shot transfer is feasible via multilingual modeling, mixing teacher-language document-level data with student-language sentence-level data in a Transformer, careful batch proportion tuning (commonly p ≈ 0.3–0.5) for optimal transfer (Zhang et al., 2021).

Supervised DocMT is bottlenecked by the availability of parallel, document-aligned corpora. Synthetic document alignment (via sentence-level back-translation aggregated into pseudo-documents) and monolingual document augmentation (pretraining, LLMs for fusion, etc.) are widely used for improving data coverage and robustness (Petrick et al., 2023, Ma et al., 2021).

4. Handling Document Structure: Context Granularity and Flow

Architectures vary in context granularity and the method for context selection:

Minimal context (typically a single preceding sentence) achieves most of the available BLEU/TER gains; additional sentences rarely add value and may even degrade performance due to attention dilution or overparameterization (Kim et al., 2019, Macé et al., 2019).
Bypassing exhaustive context modeling, approaches such as named entity or content-word filtering realize nearly full benefit with only a fraction of tokens, yielding computational savings (Kim et al., 2019).
Group-locality bias (G-Transformer), document synthetic embeddings (average-pooled word vectors), and explicit dynamic memory (Learn-to-Remember Transformer) provide tractable ways to process arbitrarily long documents (Bao et al., 2021, Macé et al., 2019, Feng et al., 2022).
Modular pipelines (GRAFT) with segmentation followed by graph-based context propagation explicitly align system context flow with linguistic discourse structure (Dutta et al., 4 Jul 2025).

5. Evaluation Protocols and Metrics

Standard MT metrics like BLEU and ChrF are mostly insensitive to document-level phenomena. Recent DocMT research employs a portfolio of document-aware metrics:

Document-level BLEU (d-BLEU) and document-COMET provide better coverage of cohesion and adequacy, with s-COMET for segment-level quality (Guo et al., 15 Jan 2025).
Specialized metrics for lexical term consistency (e.g., LTCR), discourse rewards (lexical cohesion LC, coherence COH), zero-pronoun translation accuracy (ZPT), and coreference accuracy are used for detailed analysis (Guo et al., 15 Jan 2025, Unanue et al., 2020, Cui et al., 11 Jun 2024).
Contrastive test suites (PROTEST, Müller/Bawden sets) directly probe pronoun and discourse connective accuracy (Maruf et al., 2019).
Human direct assessment and more recently GPT-4 as an automatic judge for multi-dimensional scoring (fluency, accuracy, coherence) become necessary at document scope, given BLEU’s poor correlation with discourse quality (Sun et al., 28 Oct 2024).

A selection of principal evaluation metrics is summarized below:

Metric	Measures	Best For
BLEU/d-BLEU	n-gram overlap	General adequacy, not discourse
COMET/d-COMET	Neural adequacy/quality	Document-level quality
LC / COH	Lexical cohesion, semantic coherence	Discourse-level phenomena
Pronoun/consistency	Coreference, cohesion	Contextual accuracy
LTCR	Term consistency	Terminological fidelity
LLM-as-a-judge	Fluency, accuracy, coherence	Human-like evaluation

6. Empirical Findings and Practical Lessons

Architecture gains on sentence-level baselines are often small in BLEU (0.5–2.1 points), but are magnified on targeted evaluation sets and by human raters, especially for English-centric directions or discourse-specific errors (Junczys-Dowmunt, 2019, Maruf et al., 2019).
Most context-induced improvements are not directly interpretable as manipulation of specific discourse phenomena; many arise through implicit regularization or reduced overfitting (Kim et al., 2019).
Minimal context (1–2 sentences, or just content words/entities thereof) delivers almost entire benefit at a fraction of model and compute cost (Kim et al., 2019).
Sentence-level data augmentation via back-translation at the document scope is a baseline difficult to outperform; nearly all context-aware DocMT approaches must exceed DocBT in document consistency or discourse-targeted metrics (Ma et al., 2021).
Explicit discourse rewards in RL-style training, targeted memory mechanisms, and discourse-structured pipelines (e.g., GRAFT) push the state of the art on cohesion and coherence—but at increased inference latency and complexity (Unanue et al., 2020, Feng et al., 2022, Dutta et al., 4 Jul 2025).
LLMs, with appropriate prompting and memory mechanisms (CAP, multi-knowledge fusion, agentic workflows), now close much of the gap versus traditional architectures for DocMT, especially in fluency and consistency, though off-target translation and prompt sensitivity remain unresolved (Guo et al., 15 Jan 2025, Liu et al., 15 Mar 2025, Cui et al., 11 Jun 2024, Wu et al., 12 Jan 2024).
Automatic metrics may disadvantage whole-document strategies: BLEU, in particular, fails to reward document-level improvements (coherence, cohesive use of pronouns, term consistency) and should be supplemented by multitier or LLM-based human-aligned protocols (Sun et al., 28 Oct 2024).

7. Open Challenges and Future Directions

Data scarcity: authentic, document-aligned parallel corpora are rare outside a handful of domains and languages; further work in unsupervised and weakly supervised DocMT is essential (Maruf et al., 2019).
Modeling: scalable global architectures that integrate efficient attention, hierarchical or graph-structured context, and LLMs for truly cross-sentence modeling without collapsing into local optima remain unsolved (Bao et al., 2021, Dutta et al., 4 Jul 2025).
Evaluation: BLEU and similar metrics underestimate discourse-level gains, while current document-focused metrics lack standardization. LLM-as-a-judge pipelines and more comprehensive targeted test suites are increasingly necessary (Sun et al., 28 Oct 2024).
Human parity: Claims at the sentence level do not translate to document-level human parity; open evaluation domains (conversational MT, literary, low-resource) are still underexplored (Maruf et al., 2019).
Integration: Modular agentic frameworks (GRAFT), context-aware memory, in-context learning, and plug-and-play LM fusion for contextual reranking are promising mechanisms for future DocMT engines (Dutta et al., 4 Jul 2025, Petrick et al., 2023, Cui et al., 11 Jun 2024).

A plausible implication is that future DocMT will likely feature hybrid pipelines: parameter-efficiently fine-tuned LLMs orchestrated with specialized agents for segmentation, context extraction, and translation (as in GRAFT), with rich memory and graph-based context flow, and evaluated via multidimensional and human-aligned criteria. The path to robust, truly discourse-aware MT remains a central technical challenge in the machine translation research community.