SummaRuNNer: Hierarchical RNN for Summarization

Updated 4 January 2026

The paper demonstrates that the hierarchical bi-GRU architecture combined with interpretable feature scoring achieves state-of-the-art extractive summarization on news datasets.
It employs sequential binary classification and a logistic scoring function to assess content, salience, novelty, and positional biases in sentence selection.
The model introduces dual training paradigms—oracle-based extractive and end-to-end abstractive training—enhancing both performance and interpretability through per-feature visualizations.

SummaRuNNer is a recurrent neural network-based sequence model for extractive summarization of documents. It re-casts the extractive summarization problem as sequential binary classification over sentences, using a hierarchical architecture of bidirectional Gated Recurrent Units (GRUs) at the word and sentence levels, and integrates interpretable feature-based scoring for content, salience, novelty, and positional biases. The architecture enables both standard extractive supervision with oracle-generated labels and a novel abstractive training paradigm utilizing only human-written abstracts. SummaRuNNer achieves performance that is comparable to, or better than, state-of-the-art neural extractive models on multiple summarization benchmarks, and offers unique interpretability through per-feature contribution visualizations (Nallapati et al., 2016).

1. Hierarchical Model Architecture

SummaRuNNer's architecture consists of two stacked bi-directional GRU layers: a word-level bi-GRU operating over the tokens within each sentence, and a sentence-level bi-GRU that processes the sequence of sentence embeddings representing the whole document. The model sequentially classifies each sentence for inclusion in the summary.

For document $d$ of $N_d$ sentences:

Word-level bi-GRU: For sentence $i$ , each token $t$ is passed through a forward and backward GRU. The hidden states $\{h^w_{i, t}\}$ are average-pooled to obtain a fixed-length sentence embedding $x_i$ .
Sentence-level bi-GRU: The sentence embeddings $\{x_i\}$ are processed bidirectionally to yield sentence representations $h_i = [h_i^f, h_i^b]$ .
Document vector: The document context vector is

$d = \tanh ( W_d \,\frac{1}{N_d} \sum_{j=1}^{N_d} [h_j^f, h_j^b] + b_d )$

where $W_d$ and $b_d$ are learnable parameters.

The GRU recurrence is as follows (for generic input $x_j$ and hidden state $h_{j-1}$ ):

$\begin{align*} u_j & = \sigma( W_{ux} x_j + W_{uh} h_{j-1} + b_u ) \ r_j & = \sigma( W_{rx} x_j + W_{rh} h_{j-1} + b_r ) \ h'_j & = \tanh( W_{hx} x_j + W_{hh} ( r_j \odot h_{j-1} ) + b_h ) \ h_j & = (1-u_j) \odot h'_j + u_j \odot h_{j-1} \end{align*}$

where $\sigma$ denotes the sigmoid function and $\odot$ denotes element-wise multiplication (Nallapati et al., 2016).

2. Sentence Scoring and Selection Mechanism

Each sentence is assigned a probability of inclusion in the summary using a logistic classifier with a composite scoring function that incorporates multiple human-interpretable features:

$P(y_j=1\mid h_j, s_j, d) = \sigma( W_c h_j + h_j^T W_s d - h_j^T W_r \tanh(s_j) + W_{ap} p^a_j + W_{rp} p^r_j + b )$

where:

$W_c h_j$ scores intrinsic sentence content.
$h_j^T W_s d$ measures sentence-to-document salience.
$- h_j^T W_r \tanh(s_j)$ penalizes redundancy via novelty with respect to the accumulated summary representation $s_j$ .
$p^a_j$ and $p^r_j$ are learned embeddings for absolute and relative sentence position; $W_{ap}$ and $W_{rp}$ are their associated weights.
$b$ is a scalar bias.

The summary vector $s_j$ is a running, probability-weighted sum of all prior sentence vectors:

$s_j = \sum_{i<j} h_i \cdot P(y_i=1\mid h_i, s_i, d)$

The absence of an explicit summary-length regularizer means sentence extraction at inference is typically governed by selecting top-scoring sentences up to a desired length (Nallapati et al., 2016).

3. Training Paradigms

SummaRuNNer supports both extractive and end-to-end abstractive training regimes:

Extractive training: As most summarization corpora only provide abstractive (not extractive) references, labels $y_j$ are generated by a greedy oracle that selects document sentences maximizing ROUGE F1 with respect to the reference summary, adding one at a time until no further improvement is possible. The training objective is the binary cross-entropy loss over sentences:

$L_{\text{extractive}} = - \sum_{d=1}^{N} \sum_{j=1}^{N_d} \left[ y_j^d \log P(y_j^d=1) + (1-y_j^d) \log (1-P(y_j^d=1)) \right]$

Abstractive (end-to-end) training: An RNN decoder (GRU) is placed on $s_{-1}$ , the summary representation after processing the last sentence, and trained to maximize the likelihood of reference summary words (negative log-likelihood loss).
Optimization: Word embeddings are 100d word2vec vectors; each GRU direction uses 200 hidden units. Vocabulary is capped at 150K; max 100 sentences per document, max 50 tokens per sentence. Training uses Adadelta optimization, batch size 64, gradient clipping, and early stopping on validation loss (Nallapati et al., 2016).

4. Interpretability and Feature Visualization

A distinctive attribute of SummaRuNNer is the direct interpretability of its sentence selection decisions. Each scoring term in the inclusion probability (content, salience, novelty, absolute/relative position) is independently human-readable, enabling per-sentence visualizations via normalized feature scores. This supports fine-grained analysis and model auditing; for example, the feature "heat-map" (cf. Figure 1 of the original work) shows for each sentence the breakdown of contributions from each abstract feature. Consequently, each selection can be directly correlated with its underlying rationale, addressing a key challenge in neural summarization (Nallapati et al., 2016).

5. Experimental Evaluation and Comparative Analysis

SummaRuNNer was evaluated on three primary datasets: DailyMail, CNN+DailyMail, and DUC 2002 (out-of-domain).

The following tables summarize reported ROUGE results under various evaluation settings:

DailyMail, 75-byte recall:

Model	ROUGE-1	ROUGE-2	ROUGE-L
Lead-3	21.9	7.2	11.6
LReg(500)	18.5	6.9	10.2
Cheng et al '16	22.7	8.5	12.5
SummaRuNNer-abs	23.8	9.6	13.3
SummaRuNNer	26.2±0.4	10.8±0.3	14.4±0.3

CNN/DailyMail, full-length F1:

Model	ROUGE-1	ROUGE-2	ROUGE-L
Lead-3	39.2	15.7	35.5
Nallapati et al	35.4	13.3	32.6
SummaRuNNer-abs	37.5	14.5	33.4
SummaRuNNer	39.6±0.2	16.2±0.2	35.3±0.2

On DUC 2002, SummaRuNNer is competitive but not state-of-the-art, which the authors attribute to domain shift and the use of a generic model (Nallapati et al., 2016).

6. Key Contributions and Limitations

Major contributions of SummaRuNNer include:

Demonstrating that a simple, hierarchical RNN-based sequence model matches or surpasses the previous state-of-the-art in neural extractive summarization on large-scale English news datasets.
Introducing end-to-end abstractive training for an extractive architecture, removing dependency on noisy oracle labels; however, extractive training achieves higher ROUGE in practice.
Providing interpretable, decompositional sentence selection criteria, enabling user-facing auditability uncommon among neural models of its time.

Limitations include possible noise in the greedy oracle’s extractive label generation, underperformance of the end-to-end abstractive variant compared to extractive training, a drop in performance when transferring across domains (notably to DUC 2002), and the lack of an explicit summary-length selective regularizer. The authors suggest that improved joint training or pre-training strategies and enhanced domain adaptation remain open problems (Nallapati et al., 2016).

7. Relation to Later Work and Comparative Landscape

SummaRuNNer was a precursor to subsequent extractive summarization models integrating scoring and selection in a joint neural framework. Later systems, such as NeuSum, explicitly learn to score and select sentences in a single neural loop and outperform SummaRuNNer by directly modeling the incremental ROUGE gain of candidate sentences given the current partial summary (Zhou et al., 2018). SummaRuNNer's structured feature-based decomposition and sequence-labeling perspective influenced the design of interpretable and hierarchical architectures adopted in later summarization research.

Markdown Report Issue Upgrade to Chat

References (2)

SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents (2016)

Neural Document Summarization by Jointly Learning to Score and Select Sentences (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SummaRuNNer.

SummaRuNNer: Hierarchical RNN for Summarization

1. Hierarchical Model Architecture

2. Sentence Scoring and Selection Mechanism

3. Training Paradigms

4. Interpretability and Feature Visualization

5. Experimental Evaluation and Comparative Analysis

6. Key Contributions and Limitations

7. Relation to Later Work and Comparative Landscape

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SummaRuNNer: Hierarchical RNN for Summarization

1. Hierarchical Model Architecture

2. Sentence Scoring and Selection Mechanism

3. Training Paradigms

4. Interpretability and Feature Visualization

5. Experimental Evaluation and Comparative Analysis

6. Key Contributions and Limitations

7. Relation to Later Work and Comparative Landscape

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research