Document-Level Simplification

Updated 25 December 2025

Document-level simplification is the process of automatically transforming multi-sentence texts to boost readability while preserving meanings and discourse coherence.
It leverages diverse corpora and advanced methodologies such as pretraining, hierarchical models, and LLM-guided pipelines to handle cross-sentence edits including reordering and anaphora resolution.
Key challenges include balancing simplicity with content fidelity, ensuring logical flow, and addressing data scarcity and computational demands for robust evaluation.

Document-level simplification (DLS) is the automatic transformation of an entire multi-sentence text to improve readability while preserving essential content, discourse coherence, and overall document structure. Unlike sentence-level simplification, DLS explicitly requires handling cross-sentence operations, such as information reordering, addition, deletion, and the maintenance of discourse relations and anaphora across paragraphs and chapters (Sun et al., 2021, Qiang et al., 12 Feb 2025).

1. Definition and Scope of Document-Level Simplification

DLS involves producing a simplified document $D_\mathrm{out} = \{t_1, t_2, ..., t_n\}$ from a complex source document $D_\mathrm{in} = \{s_1, s_2, ..., s_m\}$ . The output must meet several criteria:

High fluency and grammaticality at the paragraph and document level;
Faithfulness to the source’s semantic content (preservation of facts and relations);
Significantly increased readability (lexical and syntactic);
Coherence across discourse units, preserving or enhancing logical flow (Qiang et al., 12 Feb 2025, Sun et al., 2021).

This task subsumes and extends sentence-level simplification: DLS supports sentence and paragraph splitting/merging, content reorganization, anaphora resolution, global compression (removal of entire sentences/paragraphs), cross-sentence paraphrasing, and discourse marker manipulation (Laban et al., 2023, Zhong et al., 2019).

2. Corpora and Resources for Document-Level Simplification

A major bottleneck in DLS research has been the scarcity and complexity of parallel corpora annotated at the document level. Key resources include:

Corpus	Language	Size	Annotations	Key Features
D-Wikipedia	English	~143,000	Complex–simple docs	Automatic & human eval
SWiPE	English	~145,000	Fine-grained edit ops	19 edit categories; NLI
Newsela	English	>1,500	Multi-level rewrites	Grade-based, professional
DEPLAIN	German	~600–1,000	Manual sentence/doc	News & web, CEFR levels
SAMER	Arabic	15 novels ×3	Multi-level, word RL	RL-dependent lexicon

D-Wikipedia focuses on complex/simple Wikipedia lead sections with detailed human ratings (Sun et al., 2021).
SWiPE aligns full revision histories from English and Simple English Wikipedia, annotating over 40,000 specific edits in 19 categories (lexical, syntactic, discourse, semantic, non-simplification) (Laban et al., 2023).
Newsela provides professional multi-level manual simplifications of news articles with detailed readability grading.
DEPLAIN targets German and provides professional news and web documents aligned at sentence and document granularity, supporting both manual and automatic alignment (Stodden et al., 2023).
SAMER builds Arabic multi-level simplification data with rigorous word/document readability labels (Alhafni et al., 2024).

These corpora enabled the development of both supervised and unsupervised DLS models and the establishment of task-appropriate evaluation metrics.

3. System Architectures and Methodologies

Pretraining and Fine-Tuning Paradigms

Large pre-trained seq2seq models (BART, T5, mBART) dominate DLS. Key advances include:

SimpleBART, which leverages continued pre-training with a simplicity-aware masking objective using simple texts (SimpleWiki, Newsela) and ordinary texts with complex→simple span replacement. This induces strong document-scale simplification ability without architecture changes (Sun et al., 2023).
SimDoc introduces multi-objective training, jointly optimizing simplification, readability, and discourse coherence by integrating external coherence classifiers and per-document readability labels, thus explicitly targeting global document properties (Vásquez-Rodríguez et al., 2024).

Planning, Progressive, and Multi-Stage Approaches

Recognizing limits in single-step generation, several approaches decompose DLS:

Progressive, hierarchical systems (ProgDS) simulate human editing passes: discourse segmentation and subheading assignment → topic-level sentence/paragraph simplification → lexical-level rewriting. This decomposition matches human strategies for coherent, controlled simplification (Fang et al., 7 Jan 2025).
LLM-Guided pipelines utilize LLMs to first generate concise summaries (semantic scaffolds) and then produce a summary-aligned simplification, improving global coherence for long scientific documents (Marturi et al., 15 Aug 2025).

Context-Aware and Plan-Guided Models

Context-aware architectures (e.g., ConBART, LEDpara) encode explicit inter-sentence and paragraph context during simplification. This is achieved either by processing larger text units (paragraphs) or introducing additional cross-attention over dynamic document windows, ameliorating coherence and anaphora issues that plague sentence-by-sentence approaches (Cripwell et al., 2023).
Plan-guided pipelines use a first-stage planner (predicts per-sentence operation: copy, rephrase, split, delete) followed by a controlled simplification generator, further enhancing alignment and global consistency (Cripwell et al., 2023, Cripwell et al., 2024).

Decoding and Selection

Minimum Bayes Risk with Optimal Transport (MBR-OT) extends sentence-level utility functions to document level using Wasserstein distances between sets of generated sentences, better capturing structural reordering and non-aligned phenomena characteristic of DLS (Jinnai, 29 May 2025).

4. Evaluation Metrics and Analysis

Traditional simplification metrics (SARI, BLEU) are adapted for documents, but DLS metrics penalize over-deletion, verbosity, and loss of structure:

Metric	Definition & Adaptation	Purpose
D-SARI	SARI + length/sentence-penalties	Document-scale SARI (Sun et al., 2021)
FKGL/FRE	Document-wide readability	Reading grade/ease (Vásquez-Rodríguez et al., 2024)
COH_out	External classifier on coherence	Discourse assessment
SLE_doc/ESLE_doc	Simplicity-level estimator	Match to target grade
BARTScore, BLEU	Document-wide faithfulness/fluency	Fluency/factuality

Reference-based and reference-less metrics are both used; recent work stresses the need to separately analyze simplicity vs. faithfulness, given their tradeoff (more aggressive simplifiers may drop content) (Cripwell et al., 2024). D-SARI and COH_out were found to align better with human judgments of overall quality than basic SARI or BLEU.

5. Empirical Results and State of the Art

Recent strong DLS systems, as evaluated on D-Wikipedia, Newsela, or domain-specific scientific corpora, show marked improvements:

On D-Wikipedia, SimpleBART achieves 41.64 D-SARI, outperforming fine-tuned BART and plan-guided baselines (+1.8 D-SARI over BART) (Sun et al., 2023).
LLM pipelines, including GPT-4o and Llama3.1-70B, demonstrate that closed/open-source LLMs have caught up or outperformed prior non-LLM and smaller transformer models, achieving highest simplicity and lowest FKGL (Qiang et al., 12 Feb 2025).
Hierarchical/progressive systems like ProgDS, especially with CoT or iterative passes, lead D-SARI/SARI by notable margins (e.g., up to ~46.89 SARI on Newsela-A) and nearly match human references in coherence and simplicity judgments (Fang et al., 7 Jan 2025).
SimDoc’s multi-signal optimization yields up to 50.09 D-SARI and FKGL ~3.2 on Newsela, with explicit improvements in both readability and discourse coherence (Vásquez-Rodríguez et al., 2024). Sentence-level baselines lag by over 20 D-SARI points.
For low-resource languages, DEPLAIN and SAMER introduced new doc-scale corpora, and baseline doc-level models outperform sentence-level in SARI and readability, displaying greater structural handling (Stodden et al., 2023, Alhafni et al., 2024).

A representative cross-section of model performance is summarized below:

Model	D-SARI (↑)	FKGL (↓)	Coherence/COH (↑)	Dataset
SimpleBART	41.64	—	—	D-Wikipedia
GPT-4o	41.96	5.46	—	Newsela (200)
LEDpara (context)	83.0*	—	—	Newsela-auto
T5-large (SimDoc, FT)	50.09	3.24	—	NewselaS
mBART (DEPLAIN, doc)	44.56	—	—	DEPLAIN-APA

*Metric scales may differ; LEDpara reports SARI, others D-SARI; direct metric comparison must account for this.

6. Key Scientific and Technological Challenges

The main technical challenges that distinguish DLS from sentence simplification include:

Document-scale content selection: accurate identification of redundant or peripheral content for deletion (Zhong et al., 2019, Fang et al., 7 Jan 2025).
Discourse and coherence modeling: maintaining logical flow, causal or temporal relations, resolving anaphora, and ensuring sentence ordering is both comprehensible and faithful (Cripwell et al., 2023, Vásquez-Rodríguez et al., 2024).
Evaluation: balancing aggressive simplification (readability, simplicity) with meaning preservation and factual correctness—trade-offs visible in both automatic and human assessments (Cripwell et al., 2024).
Data scarcity: scalability of annotated corpora across genres, languages, and domains; alignment and annotation quality, especially for high-variance transformations (Laban et al., 2023, Stodden et al., 2023).
Computational load: DLS models may require higher context windows, hierarchical or multi-stage architectures, and expensive decoding approaches such as MBR-OT (Jinnai, 29 May 2025).

7. Open Problems and Future Directions

Several future avenues are proactively identified:

Multi-level controllability: generating simplification at different granularity/readability levels, possibly tailored to individual user profiles (Qiang et al., 12 Feb 2025, Vásquez-Rodríguez et al., 2024, Alhafni et al., 2024).
Dynamic evaluation and iterative prompting: integrating automatic metrics like SARI, COH_out inline for iterative model refinement (Marturi et al., 15 Aug 2025, Vásquez-Rodríguez et al., 2024).
Broader and cross-lingual resources: extending corpora, models, and evaluation to non-English and low-resource settings; leveraging web-scale harvesting, as for DEPLAIN (Stodden et al., 2023, Alhafni et al., 2024).
Finer-grained coherence and discourse objectives: moving towards differentiable, learning-based discourse reward or penalty signals (Vásquez-Rodríguez et al., 2024).
Hybrid and modular architectures: combining retrieval, planning, and progressive generation, or fusing oracle edit plans into LLM pipelines (Laban et al., 2023, Fang et al., 7 Jan 2025).
Robustness and factuality: addressing error cases—hallucination, over/under-compression, loss of document tone—via external knowledge or plan-guided generation.

DLS now benefits from the convergence of data resources, transformer pretraining, and LLM scaling. Contemporary models operating at the document level set a strong baseline but emphasize the need for explicit, interpretable targeting of both readability and coherence, as well as principled evaluation frameworks that respect the complexity of long-form language transformations.