Hierarchical Multiscale RNNs
- HM-RNNs are sequence models that learn latent hierarchical structures using discrete boundary detectors to manage FLUSH, UPDATE, and COPY operations.
- They operate on multiple timescales across stacked layers, enabling direct modeling of linguistic, symbolic, or temporal hierarchy in data.
- Empirical evaluations in language and handwriting tasks show HM-RNNs achieve competitive performance with improved computational efficiency over standard RNNs.
Hierarchical Multiscale Recurrent Neural Networks (HM-RNNs) refer to a class of sequence models explicitly designed to learn latent hierarchical structures by operating at multiple timescales within stacked recurrent neural network layers. Unlike conventional RNNs, which process every time step uniformly across all layers, HM-RNNs employ discrete boundary detectors and variable-timestep update schemes, enabling direct modeling of linguistic, symbolic, or temporal hierarchies from unsegmented, sequential data—most notably in character-level language modeling and handwriting modeling (Chung et al., 2016, Hwang et al., 2016).
1. Architecture and Hierarchical Segmentation
The canonical HM-RNN architecture consists of stacked recurrent layers, each associated with a boundary variable that determines, at every time step and layer , whether a segment boundary is detected.
- Layer operation: The bottom layer () processes every input symbol. Each higher layer is selectively updated only when it receives a "summary" from the layer below, which signifies the end of a detected segment.
- Segment boundaries: Boundary detectors trigger FLUSH, UPDATE, or COPY operations per layer:
- FLUSH: On , the layer resets internal memory and propagates the current state to the next layer.
- UPDATE: If and , the layer incorporates the new segment from below.
- COPY: If both are zero, the state is unchanged.
- Data-driven hierarchy: The network learns segmentations and timescales through task loss only, without explicit segmentation supervision (Chung et al., 2016).
2. Mathematical Mechanisms
The HM-RNN core is an HM-LSTM cell parameterized as follows:
- State variables: For each layer at time , we define cell state , hidden state , and boundary .
- Boundary detector: Pre-activation is a function of prior hidden states:
$\left[\begin{array}{c} f^\ell_t\i^\ell_t\o^\ell_t\g^\ell_t\\tilde b^\ell_t \end{array}\right] = \left(\sigma,\sigma,\sigma,\tanh,\mathrm{hard}\_\sigma\right)\left[s_{\mathrm{rec}} + s_{b_{\mathrm{up}}} + s_{b_{\mathrm{lo}}} + b^{(\ell)}\right].$
is squashed via a hard sigmoid with an annealing slope.
- Discrete boundary assignment: or .
- Cell update logic:
- COPY: : , .
- UPDATE: : , .
- FLUSH: : , .
- Objective: Standard negative log-likelihood for discrete symbols and mixture density loss for real-valued trajectories (Chung et al., 2016).
3. Training Procedures and Optimization
Training HM-RNNs involves handling non-differentiable operations due to binary boundaries:
- Straight-through estimator: Forward pass uses the hard step for , while backward gradients are computed as if the (annealed) hard sigmoid had been used.
- Slope annealing: The hard sigmoid slope parameter is gradually increased to approximate a true step function during training to stabilize the learning of boundaries.
- Optimization: Adam optimizer with learning rate , gradient norm clipping (), and layer normalization on all gates.
- Backpropagation: Gradients propagate only through active segments—COPY operations do not require updates—making the computation more efficient (Chung et al., 2016).
4. Empirical Performance and Qualitative Behavior
HM-RNNs have been evaluated in both character-level language modeling and handwriting sequence modeling:
- Character-level benchmarks:
- Penn Treebank (test BPC ≈ 1.24),
- Text8 (BPC ≈ 1.29, state of the art at time of publication),
- enwik8 (BPC ≈ 1.32, tying then-best neural models).
- Handwriting (IAM-OnDB): Achieved higher average log-likelihood (1167) compared to standard LSTM (1081).
- Segment boundary interpretation: The lowest layer’s boundaries align closely with word boundaries; higher layers fire at phrase or multi-character n-gram boundaries.
- Efficiency: The approach yields approximately 60% fewer RNN updates than equivalent-depth standard LSTMs (Chung et al., 2016).
- Comparative results: In WSJ language modeling, a mono-clock LSTM (4×512, 7.4M params) yields word-level PPL 93.3, while an HLSTM-B (4×512, 8.5M params) achieves 73.6. In real-time ASR, switching from a mono-clock (WER ≈ 7.85%) to HLSTM-B CLM (WER ≈ 7.79%) reduces parameters by ~70% (Hwang et al., 2016).
5. The Multiscale Clocks and Modular Hierarchy
A related but distinct hierarchy is implemented via explicit multiscale clocks and resets (as in HLSTM-B):
- Levels and clocks: Level 1 (character-level) runs at every step; level 2 (word-level) updates only on word boundaries (spaces or special tokens).
- Reset mechanism: Char-level is explicitly reset upon every word boundary, focusing short-term dynamics per word and allocating inter-word context to the upper module.
- Update formalism:
- Char-level: , with encoding reset.
- Word-level: .
- Gradient propagation: Differentiable clock/reset mechanism allows end-to-end training with truncated BPTT, maintaining a character-level output distribution (Hwang et al., 2016).
6. Comparisons and Theoretical Significance
Representative RNN architectures with distinct approaches to multi-timescale and boundary modeling include:
| Model | Boundary/Timescale Learning | Segmentation Supervision |
|---|---|---|
| HM-RNN/HLSTM-B | Learned, data-driven discrete signals | None (unsupervised) |
| Hierarchical RNN | Fixed-length, externally segmented | Required |
| Clockwork RNN | Pre-set, exponentially spaced clocks | None |
| Standard LSTM | Soft, per-unit gating | None, but no explicit segmentation |
The significance of HM-RNNs lies in their:
- Hierarchically-adaptive segmentation and update rates along the input sequence
- Computational efficiency due to sparse upper-layer updates
- Interpretability of learned segment boundaries
- Mitigation of long-range credit assignment challenges via timescale abstraction
However, challenges include the need for bias-prone estimators to train discrete boundaries, lack of explicit boundary cardinality control, and increased implementation complexity relative to conventional RNNs (Chung et al., 2016).
7. Practical and Methodological Considerations
- Model selection: Proper initialization of the boundary detector's slope and the annealing rate is important for stable segment discovery.
- Generalization: There is no guarantee the discovered boundaries will correspond to semantically meaningful units in all domains; empirical segment-to-unit alignment varies.
- Transfer to other domains: The architecture generalizes to both discrete and continuous sequences; for example, character-level language modeling and online handwriting synthesis (Chung et al., 2016).
- Parameter efficiency: Empirically, hierarchical multiscale architectures match or surpass the performance of significantly larger single-timescale models, particularly notable in low-resource or parameter-constrained settings (Hwang et al., 2016).
The HM-RNN family encapsulates a structural prior for hierarchical, multi-timescale processing in sequence modeling, combining theoretical insights on compositionality with empirical advances on long-range dependency learning and resource efficiency.