Hierarchical Attention Networks

Updated 12 November 2025

Hierarchical Attention Networks are neural architectures that encode multi-level compositional structures using recursively applied attention mechanisms.
They integrate local features (e.g., words or frames) with global context, achieving state-of-the-art results in document classification, video segmentation, and more.
Model extensions like sparsemax and multilingual variants enhance interpretability and cross-lingual transfer while maintaining competitive performance.

Hierarchical Attention Networks (HANs) are neural architectures designed to process inputs with compositional, multi-level structure by applying attention mechanisms recursively at multiple granularities. HANs were originally introduced for document classification, where they encode local compositionality (words to sentences) as well as global compositionality (sentences to documents). Subsequently, HANs have been extended to video action segmentation, extractive summarization, multilingual modeling, speaker identification, and neural machine translation, providing systematic approaches to model local saliency and global structure simultaneously. This entry provides an exhaustive technical treatment of HANs, covering foundational equations, architectural principles, major variants, and empirical findings in their principal application domains.

1. Hierarchical Attention Mechanisms: Mathematical Formulation

HANs organize the encoding process into recursive levels, each corresponding to a natural compositional boundary in the input (e.g., word ↦ sentence ↦ document; frame ↦ segment ↦ video). At each level, a recurrent subnetwork concatenates the representations from the previous level, and an attention mechanism aggregates them into a context-sensitive embedding.

Two-level HAN for text:

Word-level encoding:
- For each word vector $w_{it}$ in sentence $i$ , a bidirectional RNN (typically GRU or LSTM) outputs hidden states $h_{it}$ .
- Attention scores:
$u_{it} = \tanh(W_w h_{it} + b_w)$

$\alpha_{it} = \frac{\exp(u_{it}^T u_w)}{\sum_{t'} \exp(u_{it'}^T u_w)}$

$s_i = \sum_{t=1}^L \alpha_{it} h_{it}$
Sentence-level encoding:
- Sentence vectors $s_i$ are input to a bidirectional RNN and aggregated by analogous attention:
$u_i^s = \tanh(W_s h_i^s + b_s)$

$\beta_i = \frac{\exp((u_i^s)^T u_s)}{\sum_{j=1}^K \exp((u_j^s)^T u_s)}$

$v = \sum_{i=1}^K \beta_i h_i^s$
Classification (multi-label):

$\hat{y} = \sigma(W_c v + b_c)$

For video or sequence data, the structure is analogous; the granularity simply maps from frames to segments to global video. In action segmentation, for example, frame features $e_{i, t}$ are encoded with a frame-level LSTM, pooled by attention, then segment embeddings $s_i$ are encoded with a segment-level LSTM and pooled by segment-level attention to yield the video embedding $v$ (Gammulle et al., 2020).

2. HAN Variants and Model Extensions

HANs have been adapted and extended in several key ways, primarily through domain-specific modification of the attention functions, level structure, and parameter sharing:

2.1. HAN for Videos

In the action segmentation context (Gammulle et al., 2020), HANs exploit:

Frame-level encoder: CNN $\rightarrow$ LSTM $\rightarrow$ attention-pooling to form segment $s_i$ .
Segment-level encoder: LSTM $\rightarrow$ attention-pooling $\rightarrow$ global embedding $v$ .
Hierarchical decoder: Decodes back from $v$ to per-frame predictions via segment and frame LSTMs.

2.2. HAN Variants for Document Classification

Pruned HAN (HPAN): Discards low-attention words/sentences (below threshold $\tau$ ), renormalizes (Ribeiro et al., 2020).
Sparsemax HAN (HSAN): Replaces softmax with sparsemax, which often zeros out many attentions, promoting sparsity and interpretability.
Both variants provide negligible empirical gains on IMDB, though HSAN may yield a more interpretable distribution.

2.3. Multilingual HANs

Parameter sharing: Word/sentence encoders and/or attention weights are shared among languages; crucial to use pre-aligned multilingual embeddings for effective cross-lingual transfer (Pappas et al., 2017).
Joint multi-task objective: All languages’ losses summed; document representations become language-agnostic.

2.4. HANs for Structured Summarization and Other Modalities

Bidirectional context-aware HANs: Introduce context vectors that inject information from adjacent sentences directly into word-level attention, improving document-level feature induction (Remy et al., 2019).
Extensive applications: Speaker identification (Shi et al., 2020), extractive forum summarization (Tarnpradab et al., 2018), and document-level NMT (Miculicich et al., 2018).

3. Architectures and Training Dynamics

3.1. Core Architectural Elements

RNNs (LSTM or GRU) are typical at each level; their hidden states serve as the basis for attention computation.
For video, CNNs are used for initial feature extraction at the lowest level; in audio, TDNNs are also used before the RNNs (Shi et al., 2020).
Attention weights are computed as similarity of a hidden layer’s output to a learned context vector, using softmax or sparsemax.

3.2. Decoding in Generative HANs

Encoder-decoder HANs have been employed in sequence prediction tasks (e.g., video action segmentation), with a mirrored hierarchical decoder initialized by the global context embedding.
Decoding proceeds top-down: segment-level RNN reconstructs segment-wise embeddings, each of which is expanded by a frame-level decoder.

3.3. Loss Functions and Optimization

Classification: cross-entropy (categorical for single-label, binary for multi-label).
Sequence segmentation: per-frame cross-entropy over target classes.
All-end-to-end training schemes are compatible: parameters (recurrent, attention, classifier) are trained jointly with Adam, SGD, or RMSProp.

4. Empirical Results, Benefits, and Limitations

4.1. Performance Across Domains

Video action segmentation (Gammulle et al., 2020): HAN achieved state-of-the-art accuracy on MERL Shopping, 50 Salads, and GT Egocentric datasets, outperforming single-scale attention models.
Document classification: Pruned and sparsemax HANs matched or slightly underperformed standard HAN on IMDB (Ribeiro et al., 2020).
Multilingual classification: HANs with shared parameters outperformed monolingual HANs, especially in low-resource regimes, while reducing parameter scaling (Pappas et al., 2017).
Extractive summarization: HANs with redundancy removal surpassed both unsupervised and supervised baselines in ROUGE metrics (Tarnpradab et al., 2018).
NMT: Document-level HANs integrated as additional context encoders improved BLEU by up to 1.8 points over a strong Transformer baseline, with optimal results from combining HAN at both encoder and decoder (Miculicich et al., 2018).

4.2. Table of Key HAN Variants and Empirical Observations

Variant/Domain	Structural Novelty	Empirical Outcome
Video Segmentation	Hierarchical encoder/decoder, frame/segment attention	SOTA on multi-view datasets
Sparsemax HAN	Sparsemax for attention	$\sim$ 1.4pp < HAN (IMDB)
Pruned HAN	Hard-threshold attention	Matches HAN (IMDB)
Multilingual HAN	Enc/attention sharing	Superior in multi-lingual/low-resource settings
Context-aware HAN	Bidirectional sentence context in attention	Consistently higher accuracy on multi-class text classification

4.3. Analysis of Hierarchy vs. Single-Scale

Hierarchical design allows local encoders to specialize in fine-grained structure (e.g., frames/words), while upper levels focus on longer-range dependencies.
Flat attention over all units is less efficient for dependency modeling and can dilute signal from transient but critical events or tokens.
HAN minimizes over-segmentation errors in temporal video segmentation and achieves parameter efficiency in cross-lingual document classification.

5. Implementation Considerations and Hyperparameters

Recurrent block size: Hidden sizes $L=200$ (video segmentation), $d_h=50$ (text), $E=512$ (audio).
Pooling windows: Segmentation in audio (M=20 frames per segment, H=10 hop), video (N segments of T frames).
Attention context vector dimension: Typically matches RNN hidden size.
Optimization: Adam or RMSProp with standard learning rates (1e-3 or 1e-4).
Batch sizes: Typical ranges are 32 (summarization) to 128 (video).
Regularization: Dropout is widely used except where batch normalization is dominant (audio HANs).
Preprocessing: For multilingual HANs, embeddings must be pre-trained and aligned (multi-CCA).
Redundancy removal: Especially important for summarization: a bigram-filtering algorithm raises ROUGE-1 F from 33.7% to 37.8% (Tarnpradab et al., 2018).

6. Application Domains and Prospective Developments

HANs are broadly applicable wherever inputs possess an inherent recursive hierarchy:

Video analysis: Capturing multiscale temporal structure, egocentric perspectives, long-range dependencies (Gammulle et al., 2020).
Text understanding: Sentiment, topic, and relation classification, especially in long or multi-lingual documents.
Sequence labeling: Speaker identification from spectrograms, full-document NMT via contextual context transfer (Miculicich et al., 2018).
Summarization: Supporting extractive selection of thread posts or editorial fragments, with competitive performance to extractive and supervised baselines (Tarnpradab et al., 2018).

Notably, hierarchical attention is especially beneficial as input lengths and complexity scales: “two-stage compression + attention preserves both local fine-grained dynamics (frame-level) and global action context (segment-level), yielding stronger temporal dependency modeling and fewer over-segmentation errors than single-scale or fixed-window methods” (Gammulle et al., 2020).

This suggests broader adoption of HANs in long-sequence, highly nested domains, and motivates research into even deeper or more flexible hierarchical structures, including adaptive or learned hierarchy boundaries.

7. Limitations, Trade-offs, and Interpretability

HANs introduce extra computational overhead during attention-pooling—but remain lightweight relative to end-to-end deep stacks.
Pruned and sparsemax variants offer increased sparsity and potentially greater interpretability but require careful tuning or incur mild drops in performance for some tasks (Ribeiro et al., 2020).
Shared-parameter HANs provide strong cross-lingual regularization but depend on high-quality mutual alignment in the embedding space (Pappas et al., 2017).
Bidirectional/context-aware modifications yield modest computational increases (+23–40% per iteration) yet deliver consistent accuracy improvements for complex document understanding (Remy et al., 2019).
In generative and sequence prediction domains, hierarchical decoding can be more challenging to train stably.
Empirically, context injection is more beneficial as the downstream task requires modeling deeper semantic or discourse relations.

In summary, Hierarchical Attention Networks are a principled framework for leveraging multiscale compositional structure via recursively organized attention, validated across a broad spectrum of modalities, with systematically quantifiable gains over single-scale and non-hierarchical alternatives.