Hierarchical Document Summarization

Updated 13 April 2026

Hierarchical document summarization is a method that models internal and cross-document structure to produce coherent summaries for long or multi-source texts.
Techniques include recursive merging, neural layered encoders, and graph-based models that improve extractive and abstractive performance by leveraging text structure.
Applications span news, scientific, legal, and governmental documents where these methods boost ROUGE scores and enhance factual precision in summaries.

Hierarchical document summarization refers to a class of methods that exploit and explicitly model the internal (and/or cross-document) structure of long or multi-document inputs to produce focused, coherent summaries. Hierarchy may be imposed through recursive merging pipelines, multi-level neural architectures, or explicit discourse and concept representation, and has been systematically shown to advance both extractive and abstractive summarization performance for long and/or multi-source content across domains including news, scientific, biomedical, legal, and government documents.

1. Hierarchical Summarization Paradigms

Hierarchical summarization strategies can be categorized by where and how the hierarchy is introduced:

Hierarchical Combination of Extracts: Early frameworks extend efficient single-document extractive summarizers to multi-document settings by recursively summarizing at multiple levels. For example, a base system (such as KP-Centrality) produces fixed-length summaries for each document in a cluster, which are then merged—either by concatenating intermediate summaries and re-summarizing (single-layer hierarchy), or via stepwise merging in a waterfall/cascade (where the merged summary from one pair is recursively merged with additional documents) (Marujo et al., 2015).
Encoder-Decoder Models with Explicit Document Structure: Neural architectures capture document-internal hierarchy via nested encoders (e.g., words-to-sentences, sentences-to-sections) and/or by embedding section boundaries and category information. Hierarchical inductive biases can also be injected into Transformer attention through learned biases informed by section trees or cross-block organization (Cao et al., 2022, Shen et al., 2023).
Graph-based Hierarchical Modeling: Recent models employ hierarchical graphs (e.g., sentences and sections as nodes, with different edge types for intra-section and cross-section links), sometimes augmented with contrastive learning or latent directed trees to represent latent discourse and support information propagation and global–local context fusion (Zhang et al., 2023, Zhao et al., 2024, Qiu et al., 2022).
Hierarchical Concept and Topic Maps: Some approaches construct hierarchical concept DAGs or topic trees, with edges reflecting details/subtopics, using clustering or open information extraction, enabling summarization as a selection/pruning problem over such maps, potentially guided by user preferences in personalized settings (Ghodratnama et al., 2023).

2. Principal Algorithms and Formalisms

Classical Extractive Hierarchies

The KP-Centrality framework defines a support-set for each passage (sentence or keyphrase) as all other passages/phrases above an adaptive similarity threshold. Sentence importance is quantified by the number of support-sets in which it appears; keyphrases are extracted via supervised methods. In the cluster summary scenario, intermediate summaries form new condensed inputs to the same pipeline, enforcing hierarchical information distillation (Marujo et al., 2015).

Neural Hierarchical Architectures

Layered Recurrent/Attention Models: Bidirectional RNN or Transformer encoders at both sentence and document levels, often with self-attention, generate hierarchical representations. For instance, HSSAS uses word-level BiLSTM + attention, followed by a sentence-level BiLSTM + attention, and then a summary membership classifier combined with content, salience, novelty, and position features (Al-Sabahi et al., 2018). Analogous templates emerge in GRU- or LSTM-based extractors and joint scoring–selection architectures (Zhou et al., 2018).
Transformer-Based Hierarchies: Hierarchical transformers build in intra-block (local) and inter-block (global) self-attention; for instance, “start-of-document” (SOD) tokens enable inter-document attention routing with BART backbones. Decoder cross-attention may be scaled by document importance inferred from SOD activations, enforcing cross-document salience in the generated summary (Shen et al., 2023),

Structure-Aware Attention

Injection of hierarchical biases into attention computation—e.g., bias tables indexed by (path length, level difference) in the source section tree—directly alters attention distributions so models can exploit section/subsection relations. HIBRIDS demonstrates that even lightweight bias layers matching document structure allow full Transformers to generate explicit question–summary hierarchies or improved long-form summaries (Cao et al., 2022).

Graph Neural and Hypergraph Models

Hierarchical discourse graphs comprise nodes for sentences, sections, and optionally a global supernode, with separate intra-section, inter-section, and cross-level edges. Message passing alternates between intra-section GATs, aggregation to the section level, and inter-section GATs, with section context fusing into sentence representations (Zhang et al., 2023). Hypergraph models represent each section as a hyperedge, enabling self-attention over sets of related sentences, and hierarchical bottom-up processing (words→sentences→sections) (Zhao et al., 2024).

Multi-Document Hierarchies and Clustering

In multi-document summarization, hierarchical clustering (e.g., via top–down K-means) organizes documents into a class tree; at each node, sentences are scored for “commonality” (similarity to the node centroid) and “specificity” (dissimilarity to the complement), balancing coverage and diversity (Ma, 2023). Hierarchical recursive merging or bottom-up aggregation schemes are also implemented in clinical summarization and long-form pipeline designs (Hsu et al., 27 Oct 2025, Ou et al., 3 Feb 2025).

3. Applications, Datasets, and Evaluation

Hierarchical summarization methods are motivated and evaluated across diverse domains and data regimes:

News and Multi-Document Clusters: DUC (2007, 2002–2004), TAC 2009, and Multi-News are standard; summaries are evaluated using ROUGE-1/2/L.
Scientific and Medical Reports: PubMed, arXiv, Longsumm, and CHIME datasets challenge models by both document length (often >5,000 tokens) and the complexity of structured input. Human expert evaluation (clarity, coverage, factuality, coherence) complements ROUGE and BERTScore for such uses (Hsu et al., 27 Oct 2025).
Government and Legal Texts: Datasets such as GovReport-QS (question-summary hierarchies in US government reports) and Multi-LexSum (legal opinions) test the scalability and truthfulness of hierarchical systems (Cao et al., 2022, Ou et al., 3 Feb 2025).
Personalized Summaries: RL-driven hierarchical concept maps are optimized for individual user utility, as measured via adaptive query loops and preference learning (Ghodratnama et al., 2023).

4. Salient Empirical Results and Analysis

Quantitative evaluation across benchmarks demonstrates robust, sometimes state-of-the-art, performance for hierarchical methods:

Extractive Multi-Document Summarization: Single-layer and waterfall hierarchical KP-Centrality outperform prior extractive baselines by up to +5% ROUGE-1 on DUC 2007 and +100% relative ROUGE-2 on TAC 2009 (Marujo et al., 2015).
Neural Hierarchical Encoders: Hierarchically-structured self-attention models (HSSAS) achieve +4.7 F1 ROUGE-1 over strong baselines on DUC 2002 and state-of-the-art extractive results on CNN/DailyMail (Al-Sabahi et al., 2018).
Transformer Hierarchies: BART+HED outperforms vanilla BART by up to +3.1 R-1 and +1.8 R-L on multiple MDS datasets (Shen et al., 2023); HiStruct+ yields SOTA extractive ROUGE gains (+4.90 to +6.75 R-1 on PubMed/arXiv), with improvements scaling with the prominence of hierarchy in the corpus (Ruan et al., 2022).
Graph-based Models: CHANGES on PubMed achieves R-1=46.43, R-2=21.17, R-L=41.58, matching or exceeding prior extractive and GNN-enhanced models (Zhang et al., 2023). HAESum reports the highest ROUGE scores for extractive scientific summarization in its cohort (e.g., 48.77/22.44/43.83 on PubMed) (Zhao et al., 2024).
Faithfulness and Hallucination: Context-aware merging with extractive evidence (Extract-Support) increases fact-precision by ∼10 points over vanilla hierarchical merging; replacing or supporting intermediate digests with source passages balances faithfulness and abstractive coverage (Ou et al., 3 Feb 2025).

5. Strengths, Limitations, and Theoretical Insights

Strengths:

Hierarchical paradigms improve salience coverage, reduce redundancy, and preserve structural coherence in long or multi-source summaries.
Recursive digesting and re-encoding (summarize–then–summarize structure) filters noisy details, enforces coverage of main subtopics, and facilitates abstraction.
Plug-and-play, structure-aware methods facilitate domain transfer and adaptation to novel corpora, given accurate structural markup or robust parsing (Marujo et al., 2015, Cao et al., 2022).

Limitations:

Extractive implementations cannot paraphrase or merge information at the phrase level, leading to possible redundancy or missed fusions (Marujo et al., 2015).
Reliance on gold or reliably parsed structural cues; informal or unstructured text may remain challenging (Cao et al., 2022).
End-to-end, joint optimization across levels is rare; information can be lost between hierarchical stages, and sensitivity to hyperparameters is observed.
Long-context scaling remains a bottleneck (full attention masks in some Transformer-based encoders), with very long documents still challenging for neural implementations (Shen et al., 2023).

Theoretical insights underscore that explicit modeling of local-global, intra-inter section/cluster, and discourse/topical relations aligns with the inductive biases present in most document production, yielding improved generalization, faithfulness, and human usability when compared to flat, non-hierarchical models (Zhao et al., 2024, Zhang et al., 2023, Qiu et al., 2022, Hsu et al., 27 Oct 2025).

6. Extensions and Future Directions

Deeper or More Granular Hierarchies: Expanding current 2–3 level models to include paragraphs, sub-sentences, fine-grained rhetorical units, or explicit graph/graph-neural modules allows finer information control and coverage.
Hybrid Extractive–Abstractive Pipelines: Abstractive summarization at the top-most hierarchy can mitigate redundancy and hallucination, with context-augmented evidence to ensure factual accuracy (Ou et al., 3 Feb 2025).
Personalization and User-Adaptive Summarization: Integration of RL-driven utility models with structural hierarchies enables adaptive, user-aligned summarization workflows (Ghodratnama et al., 2023).
Generalization and Evaluation: Extension to new domains (clinical, technical, informal texts), better automated evaluation metrics for coverage/faithfulness, and hierarchical consistency (especially for multi-document and domain-expert contexts) remain active challenges (Hsu et al., 27 Oct 2025).

Hierarchical document summarization thus forms a critical and evolving foundation for scalable, interpretable, and high-quality summarization, with active research spanning recursive pipelines, neural hierarchies, graph-based discourse modeling, abstractive–extractive hybrids, and personalized concept architectures.