Papers
Topics
Authors
Recent
Search
2000 character limit reached

SummaRuNNer: Hierarchical RNN for Summarization

Updated 4 January 2026
  • The paper demonstrates that the hierarchical bi-GRU architecture combined with interpretable feature scoring achieves state-of-the-art extractive summarization on news datasets.
  • It employs sequential binary classification and a logistic scoring function to assess content, salience, novelty, and positional biases in sentence selection.
  • The model introduces dual training paradigms—oracle-based extractive and end-to-end abstractive training—enhancing both performance and interpretability through per-feature visualizations.

SummaRuNNer is a recurrent neural network-based sequence model for extractive summarization of documents. It re-casts the extractive summarization problem as sequential binary classification over sentences, using a hierarchical architecture of bidirectional Gated Recurrent Units (GRUs) at the word and sentence levels, and integrates interpretable feature-based scoring for content, salience, novelty, and positional biases. The architecture enables both standard extractive supervision with oracle-generated labels and a novel abstractive training paradigm utilizing only human-written abstracts. SummaRuNNer achieves performance that is comparable to, or better than, state-of-the-art neural extractive models on multiple summarization benchmarks, and offers unique interpretability through per-feature contribution visualizations (Nallapati et al., 2016).

1. Hierarchical Model Architecture

SummaRuNNer's architecture consists of two stacked bi-directional GRU layers: a word-level bi-GRU operating over the tokens within each sentence, and a sentence-level bi-GRU that processes the sequence of sentence embeddings representing the whole document. The model sequentially classifies each sentence for inclusion in the summary.

For document dd of NdN_d sentences:

  1. Word-level bi-GRU: For sentence ii, each token tt is passed through a forward and backward GRU. The hidden states {hi,tw}\{h^w_{i, t}\} are average-pooled to obtain a fixed-length sentence embedding xix_i.
  2. Sentence-level bi-GRU: The sentence embeddings {xi}\{x_i\} are processed bidirectionally to yield sentence representations hi=[hif,hib]h_i = [h_i^f, h_i^b].
  3. Document vector: The document context vector is

d=tanh(Wd1Ndj=1Nd[hjf,hjb]+bd)d = \tanh ( W_d \,\frac{1}{N_d} \sum_{j=1}^{N_d} [h_j^f, h_j^b] + b_d )

where WdW_d and bdb_d are learnable parameters.

The GRU recurrence is as follows (for generic input xjx_j and hidden state hj1h_{j-1}):

uj=σ(Wuxxj+Wuhhj1+bu) rj=σ(Wrxxj+Wrhhj1+br) hj=tanh(Whxxj+Whh(rjhj1)+bh) hj=(1uj)hj+ujhj1\begin{align*} u_j & = \sigma( W_{ux} x_j + W_{uh} h_{j-1} + b_u ) \ r_j & = \sigma( W_{rx} x_j + W_{rh} h_{j-1} + b_r ) \ h'_j & = \tanh( W_{hx} x_j + W_{hh} ( r_j \odot h_{j-1} ) + b_h ) \ h_j & = (1-u_j) \odot h'_j + u_j \odot h_{j-1} \end{align*}

where σ\sigma denotes the sigmoid function and \odot denotes element-wise multiplication (Nallapati et al., 2016).

2. Sentence Scoring and Selection Mechanism

Each sentence is assigned a probability of inclusion in the summary using a logistic classifier with a composite scoring function that incorporates multiple human-interpretable features:

P(yj=1hj,sj,d)=σ(Wchj+hjTWsdhjTWrtanh(sj)+Wappja+Wrppjr+b)P(y_j=1\mid h_j, s_j, d) = \sigma( W_c h_j + h_j^T W_s d - h_j^T W_r \tanh(s_j) + W_{ap} p^a_j + W_{rp} p^r_j + b )

where:

  • WchjW_c h_j scores intrinsic sentence content.
  • hjTWsdh_j^T W_s d measures sentence-to-document salience.
  • hjTWrtanh(sj)- h_j^T W_r \tanh(s_j) penalizes redundancy via novelty with respect to the accumulated summary representation sjs_j.
  • pjap^a_j and pjrp^r_j are learned embeddings for absolute and relative sentence position; WapW_{ap} and WrpW_{rp} are their associated weights.
  • bb is a scalar bias.

The summary vector sjs_j is a running, probability-weighted sum of all prior sentence vectors:

sj=i<jhiP(yi=1hi,si,d)s_j = \sum_{i<j} h_i \cdot P(y_i=1\mid h_i, s_i, d)

The absence of an explicit summary-length regularizer means sentence extraction at inference is typically governed by selecting top-scoring sentences up to a desired length (Nallapati et al., 2016).

3. Training Paradigms

SummaRuNNer supports both extractive and end-to-end abstractive training regimes:

  • Extractive training: As most summarization corpora only provide abstractive (not extractive) references, labels yjy_j are generated by a greedy oracle that selects document sentences maximizing ROUGE F1 with respect to the reference summary, adding one at a time until no further improvement is possible. The training objective is the binary cross-entropy loss over sentences:

Lextractive=d=1Nj=1Nd[yjdlogP(yjd=1)+(1yjd)log(1P(yjd=1))]L_{\text{extractive}} = - \sum_{d=1}^{N} \sum_{j=1}^{N_d} \left[ y_j^d \log P(y_j^d=1) + (1-y_j^d) \log (1-P(y_j^d=1)) \right]

  • Abstractive (end-to-end) training: An RNN decoder (GRU) is placed on s1s_{-1}, the summary representation after processing the last sentence, and trained to maximize the likelihood of reference summary words (negative log-likelihood loss).
  • Optimization: Word embeddings are 100d word2vec vectors; each GRU direction uses 200 hidden units. Vocabulary is capped at 150K; max 100 sentences per document, max 50 tokens per sentence. Training uses Adadelta optimization, batch size 64, gradient clipping, and early stopping on validation loss (Nallapati et al., 2016).

4. Interpretability and Feature Visualization

A distinctive attribute of SummaRuNNer is the direct interpretability of its sentence selection decisions. Each scoring term in the inclusion probability (content, salience, novelty, absolute/relative position) is independently human-readable, enabling per-sentence visualizations via normalized feature scores. This supports fine-grained analysis and model auditing; for example, the feature "heat-map" (cf. Figure 1 of the original work) shows for each sentence the breakdown of contributions from each abstract feature. Consequently, each selection can be directly correlated with its underlying rationale, addressing a key challenge in neural summarization (Nallapati et al., 2016).

5. Experimental Evaluation and Comparative Analysis

SummaRuNNer was evaluated on three primary datasets: DailyMail, CNN+DailyMail, and DUC 2002 (out-of-domain).

The following tables summarize reported ROUGE results under various evaluation settings:

DailyMail, 75-byte recall:

Model ROUGE-1 ROUGE-2 ROUGE-L
Lead-3 21.9 7.2 11.6
LReg(500) 18.5 6.9 10.2
Cheng et al '16 22.7 8.5 12.5
SummaRuNNer-abs 23.8 9.6 13.3
SummaRuNNer 26.2±0.4 10.8±0.3 14.4±0.3

CNN/DailyMail, full-length F1:

Model ROUGE-1 ROUGE-2 ROUGE-L
Lead-3 39.2 15.7 35.5
Nallapati et al 35.4 13.3 32.6
SummaRuNNer-abs 37.5 14.5 33.4
SummaRuNNer 39.6±0.2 16.2±0.2 35.3±0.2

On DUC 2002, SummaRuNNer is competitive but not state-of-the-art, which the authors attribute to domain shift and the use of a generic model (Nallapati et al., 2016).

6. Key Contributions and Limitations

Major contributions of SummaRuNNer include:

  • Demonstrating that a simple, hierarchical RNN-based sequence model matches or surpasses the previous state-of-the-art in neural extractive summarization on large-scale English news datasets.
  • Introducing end-to-end abstractive training for an extractive architecture, removing dependency on noisy oracle labels; however, extractive training achieves higher ROUGE in practice.
  • Providing interpretable, decompositional sentence selection criteria, enabling user-facing auditability uncommon among neural models of its time.

Limitations include possible noise in the greedy oracle’s extractive label generation, underperformance of the end-to-end abstractive variant compared to extractive training, a drop in performance when transferring across domains (notably to DUC 2002), and the lack of an explicit summary-length selective regularizer. The authors suggest that improved joint training or pre-training strategies and enhanced domain adaptation remain open problems (Nallapati et al., 2016).

7. Relation to Later Work and Comparative Landscape

SummaRuNNer was a precursor to subsequent extractive summarization models integrating scoring and selection in a joint neural framework. Later systems, such as NeuSum, explicitly learn to score and select sentences in a single neural loop and outperform SummaRuNNer by directly modeling the incremental ROUGE gain of candidate sentences given the current partial summary (Zhou et al., 2018). SummaRuNNer's structured feature-based decomposition and sequence-labeling perspective influenced the design of interpretable and hierarchical architectures adopted in later summarization research.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SummaRuNNer.