Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Attentive Memory

Updated 6 February 2026
  • HAM is a neural network memory architecture that employs a full binary tree structure for efficient, logarithmic memory access through learned top-down attention.
  • It integrates with sequential models like LSTMs using trainable operators such as JOIN, SEARCH, and WRITE to update hierarchical summaries and perform precise memory operations.
  • Empirical results show HAM’s superior performance in algorithmic tasks, generalizing well in tasks such as sorting while accurately emulating classical data structures like stacks and queues.

Hierarchical Attentive Memory (HAM) constitutes a neural network memory architecture characterized by a hierarchical, tree-structured organization of memory cells, enabling sublinear complexity for memory access via learned top-down attention. Introduced as an augmentation to sequential models such as LSTMs, HAM has demonstrated the capacity to efficiently learn algorithmic tasks, emulate classic data structures, and scale memory usage to large contexts without a commensurate increase in parameter count or inference cost (Andrychowicz et al., 2016).

1. Tree-Based Memory Organization and Structural Encoding

HAM realizes its memory as a full binary tree with nn leaves (nn chosen as a power of two no less than the longest pattern required). The tree comprises $2n-1$ nodes, where the set of leaves LL serves as the addressable memory cells, and each internal node ee maintains a dd-dimensional vector heRdh_e\in\mathbb{R}^d summarizing its descendants. Internal node summaries are produced by a trainable "JOIN" MLP: he=JOIN(h(e),hr(e))h_e = \text{JOIN}(h_{\ell(e)}, h_{r(e)}), where (e)\ell(e) and r(e)r(e) are the left and right children. The memory tree is initialized by embedding the input sequence into the leaves, followed by recursive bottom-up computation of the internal summaries.

2. Hierarchical Attention and Efficient Access Mechanism

Memory access in HAM is conducted via top-down attention. Given a query qq (typically the hidden state hLSTMh_\mathrm{LSTM} of a controller LSTM), a path is selected from the root to a leaf through O(logn)O(\log n) decisions, each parameterized by a SEARCH MLP that returns a probability pe=SEARCH(he,q)[0,1]p_e = \text{SEARCH}(h_e, q)\in[0,1] describing whether to traverse left or right. Each attention decision is performed as a Bernoulli sample during training, and made deterministically according to pe>0.5p_e>0.5 at inference. This process yields O(logn)O(\log n) memory access cost per read or write, a marked gain over flat attention's O(n)O(n) complexity.

3. Read–Write Phases, Update Dynamics, and Differentiability

After the attention traversal, the selected leaf's hidden state hah_a is read out and made available to the controller as input. The controller then produces an updated hidden state, and the leaf is written via a highway-style WRITE MLP: haWRITE(ha,hLSTM)h_a\gets \text{WRITE}(h_a, h_\mathrm{LSTM}), in which gating is learned to combine the prior value with an update. Subsequently, the ancestor summaries on the path from the leaf to the root are recomputed with the JOIN operator to maintain correct memory aggregation. A fully differentiable relaxation, DHAM, spreads attention weights across all leaves by computing path probabilities and linearly combining cell states, but this relaxation incurs O(n)O(n) cost and exhibits weaker out-of-distribution generalization.

4. Integration with Sequential Controllers and Training Procedure

HAM is operated in conjunction with an LSTM, which acts as a controller to issue query vectors and receive memory contents. All constituent operators—EMBED, JOIN, SEARCH, WRITE, and the LSTM itself—are trained end-to-end. The discrete memory addressing (hard attention) is optimized via the REINFORCE gradient estimator, applying entropy regularization and variance reduction. At test time, all stochastic choices in attention are replaced with deterministic argmax decisions.

5. Empirical Results in Learning Algorithms

HAM-augmented LSTMs have been evaluated on tasks requiring algorithmic reasoning, including sequence reversal, binary search, merging, sorting, and addition. HAM achieves perfect or near-perfect test error on training lengths up to n=32n=32 for all tasks except addition, and uniquely demonstrates generalization to inputs up to four times longer (e.g., n=128n=128) with negligible error on all tasks except addition. Notably, for sequence sorting, HAM is the first neural model shown to learn a Θ(nlogn)\Theta(n\log n)-time algorithm that generalizes well to much longer sequences than used during training (Andrychowicz et al., 2016).

Task LSTM LSTM + Flat Attn LSTM + HAM (train) LSTM + HAM (test, longer)
Reverse 73% 0% 0% 0%
Search 62% 0.04% 0.12% 1.68%
Merge 88% 16% 0% 2.48%
Sort 99% 25% 0.04% 0.24%
Add 39% 0% 0% 100%

The table illustrates superior performance and generalization for all algorithmic tasks except addition; addition remains challenging due to long-range dependencies.

6. Emulation of Classic Data Structures

HAM can be decoupled from the LSTM and directly manipulated via external commands to function as a stack, FIFO queue, or priority queue. PUSH and POP operations are directly mapped to memory access and manipulation, with the query supplied as the external command. Training up to $32$ operations, HAM achieves near-zero test errors and perfect generalization when increasing the number of operations to $128$. For example, the stack and queue are simulated without error, while the priority queue achieves test and generalization errors below 0.2%0.2\% (Andrychowicz et al., 2016).

Structure Test Error Generalization Error
Stack 0% 0%
Queue 0% 0%
PriorityQueue 0.08% 0.2%

7. Significance and Context within Hierarchical Memory Models

HAM exemplifies a class of neural architectures exploiting hierarchical organization to address limitations of flat attention in terms of complexity and scaling. Unlike alternative hierarchical memory modules such as Hierarchical Memory Networks (HMN) for answer selection—where coarse-to-fine attention is performed over sentence and word-level memories to improve the selection of rare or unknown words (Xu et al., 2016)—HAM leverages tree-structured memory to achieve logarithmic access efficiency and enable the learning of algorithmic behavior from weak supervision. This places HAM as a foundational mechanism for neural models that require both sample and computational efficiency in tasks demanding complex, structured memory access. A plausible implication is the broader applicability of tree-based memory architectures in learned algorithmic reasoning and neural data structure emulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Attentive Memory (HAM).