Document-Level Simplification
- Document-level simplification is the process of automatically transforming multi-sentence texts to boost readability while preserving meanings and discourse coherence.
- It leverages diverse corpora and advanced methodologies such as pretraining, hierarchical models, and LLM-guided pipelines to handle cross-sentence edits including reordering and anaphora resolution.
- Key challenges include balancing simplicity with content fidelity, ensuring logical flow, and addressing data scarcity and computational demands for robust evaluation.
Document-level simplification (DLS) is the automatic transformation of an entire multi-sentence text to improve readability while preserving essential content, discourse coherence, and overall document structure. Unlike sentence-level simplification, DLS explicitly requires handling cross-sentence operations, such as information reordering, addition, deletion, and the maintenance of discourse relations and anaphora across paragraphs and chapters (Sun et al., 2021, Qiang et al., 12 Feb 2025).
1. Definition and Scope of Document-Level Simplification
DLS involves producing a simplified document from a complex source document . The output must meet several criteria:
- High fluency and grammaticality at the paragraph and document level;
- Faithfulness to the source’s semantic content (preservation of facts and relations);
- Significantly increased readability (lexical and syntactic);
- Coherence across discourse units, preserving or enhancing logical flow (Qiang et al., 12 Feb 2025, Sun et al., 2021).
This task subsumes and extends sentence-level simplification: DLS supports sentence and paragraph splitting/merging, content reorganization, anaphora resolution, global compression (removal of entire sentences/paragraphs), cross-sentence paraphrasing, and discourse marker manipulation (Laban et al., 2023, Zhong et al., 2019).
2. Corpora and Resources for Document-Level Simplification
A major bottleneck in DLS research has been the scarcity and complexity of parallel corpora annotated at the document level. Key resources include:
| Corpus | Language | Size | Annotations | Key Features |
|---|---|---|---|---|
| D-Wikipedia | English | ~143,000 | Complex–simple docs | Automatic & human eval |
| SWiPE | English | ~145,000 | Fine-grained edit ops | 19 edit categories; NLI |
| Newsela | English | >1,500 | Multi-level rewrites | Grade-based, professional |
| DEPLAIN | German | ~600–1,000 | Manual sentence/doc | News & web, CEFR levels |
| SAMER | Arabic | 15 novels ×3 | Multi-level, word RL | RL-dependent lexicon |
- D-Wikipedia focuses on complex/simple Wikipedia lead sections with detailed human ratings (Sun et al., 2021).
- SWiPE aligns full revision histories from English and Simple English Wikipedia, annotating over 40,000 specific edits in 19 categories (lexical, syntactic, discourse, semantic, non-simplification) (Laban et al., 2023).
- Newsela provides professional multi-level manual simplifications of news articles with detailed readability grading.
- DEPLAIN targets German and provides professional news and web documents aligned at sentence and document granularity, supporting both manual and automatic alignment (Stodden et al., 2023).
- SAMER builds Arabic multi-level simplification data with rigorous word/document readability labels (Alhafni et al., 2024).
These corpora enabled the development of both supervised and unsupervised DLS models and the establishment of task-appropriate evaluation metrics.
3. System Architectures and Methodologies
Pretraining and Fine-Tuning Paradigms
Large pre-trained seq2seq models (BART, T5, mBART) dominate DLS. Key advances include:
- SimpleBART, which leverages continued pre-training with a simplicity-aware masking objective using simple texts (SimpleWiki, Newsela) and ordinary texts with complex→simple span replacement. This induces strong document-scale simplification ability without architecture changes (Sun et al., 2023).
- SimDoc introduces multi-objective training, jointly optimizing simplification, readability, and discourse coherence by integrating external coherence classifiers and per-document readability labels, thus explicitly targeting global document properties (Vásquez-RodrÃguez et al., 2024).
Planning, Progressive, and Multi-Stage Approaches
Recognizing limits in single-step generation, several approaches decompose DLS:
- Progressive, hierarchical systems (ProgDS) simulate human editing passes: discourse segmentation and subheading assignment → topic-level sentence/paragraph simplification → lexical-level rewriting. This decomposition matches human strategies for coherent, controlled simplification (Fang et al., 7 Jan 2025).
- LLM-Guided pipelines utilize LLMs to first generate concise summaries (semantic scaffolds) and then produce a summary-aligned simplification, improving global coherence for long scientific documents (Marturi et al., 15 Aug 2025).
Context-Aware and Plan-Guided Models
- Context-aware architectures (e.g., ConBART, LEDpara) encode explicit inter-sentence and paragraph context during simplification. This is achieved either by processing larger text units (paragraphs) or introducing additional cross-attention over dynamic document windows, ameliorating coherence and anaphora issues that plague sentence-by-sentence approaches (Cripwell et al., 2023).
- Plan-guided pipelines use a first-stage planner (predicts per-sentence operation: copy, rephrase, split, delete) followed by a controlled simplification generator, further enhancing alignment and global consistency (Cripwell et al., 2023, Cripwell et al., 2024).
Decoding and Selection
- Minimum Bayes Risk with Optimal Transport (MBR-OT) extends sentence-level utility functions to document level using Wasserstein distances between sets of generated sentences, better capturing structural reordering and non-aligned phenomena characteristic of DLS (Jinnai, 29 May 2025).
4. Evaluation Metrics and Analysis
Traditional simplification metrics (SARI, BLEU) are adapted for documents, but DLS metrics penalize over-deletion, verbosity, and loss of structure:
| Metric | Definition & Adaptation | Purpose |
|---|---|---|
| D-SARI | SARI + length/sentence-penalties | Document-scale SARI (Sun et al., 2021) |
| FKGL/FRE | Document-wide readability | Reading grade/ease (Vásquez-RodrÃguez et al., 2024) |
| COH_out | External classifier on coherence | Discourse assessment |
| SLE_doc/ESLE_doc | Simplicity-level estimator | Match to target grade |
| BARTScore, BLEU | Document-wide faithfulness/fluency | Fluency/factuality |
Reference-based and reference-less metrics are both used; recent work stresses the need to separately analyze simplicity vs. faithfulness, given their tradeoff (more aggressive simplifiers may drop content) (Cripwell et al., 2024). D-SARI and COH_out were found to align better with human judgments of overall quality than basic SARI or BLEU.
5. Empirical Results and State of the Art
Recent strong DLS systems, as evaluated on D-Wikipedia, Newsela, or domain-specific scientific corpora, show marked improvements:
- On D-Wikipedia, SimpleBART achieves 41.64 D-SARI, outperforming fine-tuned BART and plan-guided baselines (+1.8 D-SARI over BART) (Sun et al., 2023).
- LLM pipelines, including GPT-4o and Llama3.1-70B, demonstrate that closed/open-source LLMs have caught up or outperformed prior non-LLM and smaller transformer models, achieving highest simplicity and lowest FKGL (Qiang et al., 12 Feb 2025).
- Hierarchical/progressive systems like ProgDS, especially with CoT or iterative passes, lead D-SARI/SARI by notable margins (e.g., up to ~46.89 SARI on Newsela-A) and nearly match human references in coherence and simplicity judgments (Fang et al., 7 Jan 2025).
- SimDoc’s multi-signal optimization yields up to 50.09 D-SARI and FKGL ~3.2 on Newsela, with explicit improvements in both readability and discourse coherence (Vásquez-RodrÃguez et al., 2024). Sentence-level baselines lag by over 20 D-SARI points.
- For low-resource languages, DEPLAIN and SAMER introduced new doc-scale corpora, and baseline doc-level models outperform sentence-level in SARI and readability, displaying greater structural handling (Stodden et al., 2023, Alhafni et al., 2024).
A representative cross-section of model performance is summarized below:
| Model | D-SARI (↑) | FKGL (↓) | Coherence/COH (↑) | Dataset |
|---|---|---|---|---|
| SimpleBART | 41.64 | — | — | D-Wikipedia |
| GPT-4o | 41.96 | 5.46 | — | Newsela (200) |
| LEDpara (context) | 83.0* | — | — | Newsela-auto |
| T5-large (SimDoc, FT) | 50.09 | 3.24 | — | NewselaS |
| mBART (DEPLAIN, doc) | 44.56 | — | — | DEPLAIN-APA |
*Metric scales may differ; LEDpara reports SARI, others D-SARI; direct metric comparison must account for this.
6. Key Scientific and Technological Challenges
The main technical challenges that distinguish DLS from sentence simplification include:
- Document-scale content selection: accurate identification of redundant or peripheral content for deletion (Zhong et al., 2019, Fang et al., 7 Jan 2025).
- Discourse and coherence modeling: maintaining logical flow, causal or temporal relations, resolving anaphora, and ensuring sentence ordering is both comprehensible and faithful (Cripwell et al., 2023, Vásquez-RodrÃguez et al., 2024).
- Evaluation: balancing aggressive simplification (readability, simplicity) with meaning preservation and factual correctness—trade-offs visible in both automatic and human assessments (Cripwell et al., 2024).
- Data scarcity: scalability of annotated corpora across genres, languages, and domains; alignment and annotation quality, especially for high-variance transformations (Laban et al., 2023, Stodden et al., 2023).
- Computational load: DLS models may require higher context windows, hierarchical or multi-stage architectures, and expensive decoding approaches such as MBR-OT (Jinnai, 29 May 2025).
7. Open Problems and Future Directions
Several future avenues are proactively identified:
- Multi-level controllability: generating simplification at different granularity/readability levels, possibly tailored to individual user profiles (Qiang et al., 12 Feb 2025, Vásquez-RodrÃguez et al., 2024, Alhafni et al., 2024).
- Dynamic evaluation and iterative prompting: integrating automatic metrics like SARI, COH_out inline for iterative model refinement (Marturi et al., 15 Aug 2025, Vásquez-RodrÃguez et al., 2024).
- Broader and cross-lingual resources: extending corpora, models, and evaluation to non-English and low-resource settings; leveraging web-scale harvesting, as for DEPLAIN (Stodden et al., 2023, Alhafni et al., 2024).
- Finer-grained coherence and discourse objectives: moving towards differentiable, learning-based discourse reward or penalty signals (Vásquez-RodrÃguez et al., 2024).
- Hybrid and modular architectures: combining retrieval, planning, and progressive generation, or fusing oracle edit plans into LLM pipelines (Laban et al., 2023, Fang et al., 7 Jan 2025).
- Robustness and factuality: addressing error cases—hallucination, over/under-compression, loss of document tone—via external knowledge or plan-guided generation.
DLS now benefits from the convergence of data resources, transformer pretraining, and LLM scaling. Contemporary models operating at the document level set a strong baseline but emphasize the need for explicit, interpretable targeting of both readability and coherence, as well as principled evaluation frameworks that respect the complexity of long-form language transformations.