Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grid LSTM Neural Networks

Updated 22 June 2026
  • Grid LSTM is a neural architecture that arranges LSTM cells along multidimensional grids to enable recurrent computation across time, depth, and spatial dimensions.
  • It improves gradient propagation by integrating gates along multiple axes, addressing vanishing gradient issues seen in traditional stacked LSTMs.
  • Empirical studies show Grid LSTM outperforms conventional models in tasks like integer addition, sequence memorization, and translation with fewer parameters.

Grid LSTM refers to a family of neural architectures that arrange Long Short-Term Memory (LSTM) cells along multidimensional grids, enabling learnable recurrent computation along multiple axes including time, depth, and spatial dimensions. This structure allows Grid LSTM networks to process vectors, sequences, and higher-dimensional arrays such as images, achieving unified deep and sequential computation. By placing LSTM cells along each axis at every grid location, Grid LSTM addresses limitations of traditional deep (stacked) LSTMs, notably the vanishing gradient problem through depth and the inability to natively model multidimensional dependencies.

1. Mathematical Formulation

A standard LSTM cell at time tt takes input xtx_t, previous hidden state ht1Rdh_{t-1}\in \mathbb{R}^d, and previous cell state mt1Rdm_{t-1} \in \mathbb{R}^d. The update equations are: H=[Ixt ht1]R2dH = \begin{bmatrix} I x_t \ h_{t-1} \end{bmatrix} \in \mathbb{R}^{2d}

gu=σ(WuH),gf=σ(WfH),go=σ(WoH),gc=tanh(WcH)g^u = \sigma(W^u H),\quad g^f = \sigma(W^f H),\quad g^o = \sigma(W^o H),\quad g^c = \tanh(W^c H)

mt=gfmt1+gugc,ht=tanh(gomt)m_t = g^f \odot m_{t-1} + g^u \odot g^c,\quad h_t = \tanh(g^o \odot m_t)

Grid LSTM generalizes this to NN dimensions. At each grid point, it receives NN pairs of hidden and cell states {hi,mi}i=1N\{h_i, m_i\}_{i=1}^N: xtx_t0 For each dimension xtx_t1, it computes: xtx_t2

xtx_t3

The xtx_t4 output pairs are then routed along the xtx_t5 grid axes (Kalchbrenner et al., 2015).

2. Multidimensional Network Structure

In an xtx_t6-dimensional Grid LSTM, the network forms a grid indexed by coordinates xtx_t7, where each location contains a Grid LSTM block maintaining xtx_t8 hidden and cell states, one per axis. The update at each location is performed by applying the cell update equations above, each along its respective axis using axis-specific weights. This enables propagation of information both temporally, across layers ("depth"), and spatially.

Architecturally, axes can have semantic interpretations: time (sequence), depth (stacked layers), or spatial dimensions (e.g., xtx_t9 and ht1Rdh_{t-1}\in \mathbb{R}^d0 for images). Weight sharing along an axis imposes invariance (such as convolutional or translational symmetry), while dimensions where gating is not desired can employ simple affine transformations instead of full LSTM gating (Kalchbrenner et al., 2015).

3. Training Methodology

Grid LSTM networks are trained using back-propagation through time and space, minimizing the appropriate task-dependent cross-entropy (negative log-likelihood) objective. Empirical regimes employ Adam (learning rate 0.001), batch sizes tailored to the problem (e.g., 15 for algorithmic tasks, 100 for sequence/language modeling, 128 for images, 1 for translation). For character models, back-propagation is truncated every 50 steps over sequences of length up to 10,000. Dropout is optionally used (e.g., 0.5 in translation tasks) (Kalchbrenner et al., 2015).

4. Empirical Performance and Applications

Grid LSTM has been empirically evaluated on algorithmic, language modeling, and sequence-to-sequence translation tasks:

  • 15-digit integer addition: Tied-weight 2D Grid LSTM with 18 layers (>99% accuracy in <0.55M samples) vastly outperforms 1-layer Stacked LSTM (51% accuracy after 5M samples).
  • Sequence memorization: Tied 2-LSTM (43 layers) achieves 100% accuracy in <0.15M examples. Stacked LSTM (any depth >16) fails to exceed 50%.
  • Wikipedia character prediction: On the 100M-character Wikipedia benchmark, tied 2LSTM attains 1.47 bits-per-character (bpc), improving over Stacked LSTM (1.67 bpc), MRNN (1.60 bpc), and GFRNN (1.58 bpc), with fewer parameters (8.8M) (Kalchbrenner et al., 2015).
  • Translation ("Reencoder" model): A 3D Grid LSTM can scan the source sequence, generate the target, and reencode in a deep fashion. On Chinese–English IWSLT BTEC, a 3-LSTM ensemble achieves a BLEU-4 of 42.4/60.2, outperforming the CDEC phrase-based baseline (41.0/58.9) (Kalchbrenner et al., 2015). Implicit attention arises as a byproduct of the 2D/3D modeling.

5. Architectural Advantages and Limitations

Advantages:

  • Enhanced gradient flow along both depth and temporal/spatial axes, enabling robust very deep or long-horizon models.
  • Weight tying yields strong invariance properties.
  • ht1Rdh_{t-1}\in \mathbb{R}^d1-way cell update circumvents the memory overhead of standard multidimensional LSTMs.
  • Empirically outperforms conventional stacked LSTM architectures on sequence tasks, language modeling, and translation (Kalchbrenner et al., 2015).

Limitations:

  • Computational cost scales with both the number of dimensions and the size of the grid (e.g., ht1Rdh_{t-1}\in \mathbb{R}^d2 in 2D versus ht1Rdh_{t-1}\in \mathbb{R}^d3 for sequence-to-sequence encoder–decoder).
  • Increased memory requirements, as each grid cell maintains multiple memory states.
  • Selection of dimensions to gate versus use simple transformations must be carefully tuned to prevent over-parameterization or instability.

A plausible implication is that while Grid LSTM is broadly useful for multidimensional data, its scalability is constrained by these resource demands (Kalchbrenner et al., 2015).

6. Recent Developments and Extensions

The Recurrent Independent Grid LSTM (RigLSTM) extends Grid LSTM by making several modifications:

  • RigLSTM uses a grid of ht1Rdh_{t-1}\in \mathbb{R}^d4 independent LSTM cells, with dynamic activation: only a subset of cells update per step based on relevance scores. Input features and hidden state selection are also sparsified dynamically.
  • At each time step, the raw input is mapped to ht1Rdh_{t-1}\in \mathbb{R}^d5 “views,” and each cell attends to the most relevant views and previous cell states. Updates proceed via standard LSTM gates, followed by a context-aware soft attention blending of previous and new hidden states: ht1Rdh_{t-1}\in \mathbb{R}^d6 where ht1Rdh_{t-1}\in \mathbb{R}^d7 are softmax scores parameterized by the input context.
  • This design leads to sparser computation, promoting modular specialization and improved generalization to changes in test-time conditions.

Empirical results demonstrate RigLSTM's gains over both Grid LSTM and other dynamic recurrent architectures (e.g., RIM): for example, MNIST and CIFAR accuracy at higher input resolutions (RigLSTM ≈59.6% MNIST at 24×24 vs. Grid LSTM ≈20.8%), bounce video prediction with lower error (RigLSTM BCE loss ≈30.5 vs. Grid LSTM ≈38.4), and stronger RL and sequence copying performance under domain shift (Wang et al., 2023).

7. Significance and Context Within Sequential Modeling

Grid LSTM, by merging deep and sequential recurrence in a single architecture, addresses core limitations of both shallow and deep RNN designs. The introduction of multidimensional gating yields improved gradient propagation and enables joint modeling of spatial and sequential dependencies. The architecture is distinguished from traditional multidimensional LSTM by its avoidance of exponential memory scaling, and from vanilla stacked LSTM by its robust performance in long or deep regimes.

Subsequent extensions such as RigLSTM demonstrate the ongoing evolution of the grid concept for even greater modularity, interpretability, and transfer robustness—key concerns in sequential learning. Within the broader RNN literature, Grid LSTM and its descendants have provided a template for unifying deep, modular, and multiaxial computation across a variety of modalities (Kalchbrenner et al., 2015, Wang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grid LSTM.