EDU-based Context Compressor
- EDU-based Context Compressor is a framework that decomposes documents into Elementary Discourse Units (EDUs) to preserve fine-grained semantic structure.
- It employs a 'structure-then-select' paradigm by constructing hierarchical EDU trees and greedily selecting relevant subtrees under strict token budgets.
- Empirical results demonstrate substantial improvements in LLM performance and efficiency, outperforming flat token-pruning methods on long-context tasks.
The EDU-based Context Compressor is an explicit context compression framework designed for long-document tasks in natural language processing, targeting the efficient and faithful selection of document spans most relevant to a user query, subject to strict input constraints for downstream LLMs. The approach is grounded in the decomposition of text into Elementary Discourse Units (EDUs), forming a hierarchical, source-traceable representation that preserves both document structure and fine-grained semantics. Compression is achieved via a “structure-then-select” paradigm: documents are parsed into a relational EDU tree, then relevant subtrees are greedily selected and linearized to form the final compressed context under a hard token budget. This methodology offers substantial improvements in LLM performance across diverse long-context tasks, surpassing alternatives that rely on flat token pruning or latent vector encodings (Zhou et al., 16 Dec 2025).
1. Problem Motivation and Formalization
Scaling LLMs to lengthy inputs is impeded by inference costs, input length limits, and noise introduction. Given a document or corpus , a query , and a context budget (in tokens), the objective is to generate an extractive summary of supporting -relevance while remaining within . The compression is modeled as an optimization:
where is the complete set of EDUs extracted from , quantifies how well selection supports answering , and aggregates the token counts of all selected EDUs. Unlike flat context selection mechanisms, this framework imposes explicit structural modeling and provenance at the unit of discourse.
2. LingoEDU: Explicit EDU Tree Construction
The core of the method is LingoEDU, a supervised, explicit decomposition pipeline producing a hierarchical EDU tree . The stages are:
- Segmentation: The document is split into atomic EDUs , each characterized by for text, character span, and unique index, respectively.
- Tree Induction: A supervised model learns a mapping , yielding nodes , with as an abstractive heading, as hierarchical depth, and as the exact EDU interval.
- Discourse Relations: Edges in denote relations (e.g., elaboration, contrast), safeguarding both macro-structural integrity and micro-level dependencies.
To enable large-scale supervision despite lack of gold-standard trees, a Solver–Critic loop synthesizes labeled training instances via LLM-aided structural extraction and segmentation, followed by consistency-checked abstraction. The model is then fine-tuned—first on 100,000 machine-synthesized examples, then on thousands of manually annotated instances—using an Augmented Markdown schema to ensure both traceability and interpretability.
3. Relevance Ranking and Budgeted Selection
Given the document tree, each node (or its corresponding subtree) is a candidate unit for retrieval. The framework employs a lightweight neural relevance scorer :
where is the node’s abstract title, is a representative snippet from the underlying span, and denotes concatenation. Input and candidate are embedded into vectors, and is computed via a multilayer perceptron that utilizes elementwise and concatenated representations. The scoring network is optimized with a pairwise hinge loss to enforce large margins between more and less relevant nodes.
Selection under budget is performed greedily: nodes are ranked by and added sequentially until the token count would exceed , with intervals guaranteeing retrieval of full contiguous spans. These are re-ordered by to restore discourse flow before concatenation, ensuring local and global coherence.
4. Structural Evaluation: The StructBench Benchmark
To assess structural fidelity, the StructBench dataset comprises 248 manually marked documents spanning various domains, languages, and lengths (up to 50,000 words). Evaluation metrics include Tree-Edit Distance (TED, lower is better, reflecting micro-level structural mismatches) and Document-Level Accuracy (DLA, fraction of documents with perfect backbone recovery).
| Method | TED (↓) | DLA (↑) |
|---|---|---|
| GPT-4.0 | 8.53 | 36.29% |
| Claude-4 | 7.98 | 41.53% |
| LingoEDU (ours) | 5.67 | 46.77% |
The LingoEDU approach outperforms large foundation models while maintaining low inference cost and latency ($0.0007$ USD/doc, ≈1.2 s/doc). Ablations reveal that using only interval indices (excluding abstracts) markedly degrades DLA (33.47%), and performance remains robust even with sharply reduced training supervision. This suggests that abstracted textual information and explicit structure are critical for accurate discourse modeling (Zhou et al., 16 Dec 2025).
5. Downstream Efficacy in LLM-Driven Tasks
Integration of EDU-based compression yields pronounced gains in several long-context and reasoning benchmarks:
- LongBench: On GPT-4.1, HotpotQA accuracy rises from 65.83 to 70.11 (+6.50%), and on Gemini-2.5-Pro from 35.20 to 40.46 (+14.94%). Summarization and few-shot learning tasks display consistent relative improvements.
- Deep Search and Noisy Web QA: Using DeepSeek-R1 on HLE, accuracy improves from 9.0 to 13.6 (+51.11%), and for Qwen3-235B on BrowseComp-ZH from 8.5 to 12.8 (+50.59%). Structured compression enhances model robustness against layout noise and enables strong performance even for closed-source models such as GPT-5 and Gemini-3-Pro (gains in the range of 8–15%).
These results demonstrate that compressing via explicit, structure-aware EDUs preserves or enhances downstream information utility relative to raw LLM performance, particularly under hard context limitations.
6. Trade-offs, Limitations, and Analytical Insights
The EDU-based Context Compressor balances several desiderata:
- Structural Fidelity: Anchoring selections to source-indexed EDUs and modeling discourse relations mitigate common errors such as hallucinated sections, hierarchy collapse, and omission of detail seen in baseline LLM parsers.
- Coherence Preservation: Span boundaries coincide with natural discourse divides, preserving local semantic continuity and avoiding fragmentary or incoherent extractions typical of sentence-level pruning.
- Efficiency: Empirically, context length is reduced by 50–80%, with both performance and computational cost superior to generic summarization or latent compression pipelines. The approach does not require proprietary APIs or latent embeddings, ensuring compatibility and transparency.
A key ablation shows that omission of textual abstracts undermines semantic anchoring and sharply reduces accuracy, indicating that both structure and interpreted content are necessary for high-fidelity compression.
Structural error analysis of alternative methods reveals systematic deficiencies—e.g., hallucination, hierarchy collapse—where LingoEDU’s supervised tree-construction maintains source traceability and semantic integrity. This suggests that explicit, hierarchy-aware context compression is critical for scaling LLM reasoning to complex, long-form inputs.
7. Broader Implications and Position Among Compression Frameworks
The EDU-based approach distinguishes itself from both flat token-pruning and latent vector encoding by providing an explicit, loss-aware, and semantically interpretative intermediate structure—a property crucial for academic, legal, and scientific document processing where both traceability and context-respecting summarization are paramount. It is fully compatible with closed-source APIs, and its methodology eschews hallucination and implicit encoding, offering operational transparency (Zhou et al., 16 Dec 2025).
A plausible implication is that such explicit structural frameworks could serve as a foundation for future scalable context management architectures, task-agnostic long document understanding, and broad-coverage summarization pipelines, particularly where regulatory or interpretability constraints are binding.