Convolutional LSTM Networks

Updated 16 November 2025

Convolutional LSTM is a neural network architecture that replaces standard affine gate operations with convolutional ones to effectively capture spatial and temporal dependencies.
It has been successfully applied in precipitation nowcasting, video analysis, and biomedical imaging, yielding state-of-the-art performance across these domains.
Advanced variants like dual-stream, multi-kernel, and reduced-gate models enhance efficiency and adaptability for diverse spatiotemporal modeling tasks.

A Convolutional Long Short-Term Memory (ConvLSTM) network is a recurrent neural architecture that generalizes the fully connected LSTM to spatiotemporal data, replacing the dense affine transformations in gate computations with local convolutional operations. This modification enables the network to jointly exploit spatial and temporal dependencies—critical for structured inputs such as image sequences, video, gridded sensor streams, or spatially indexed multivariate time series. ConvLSTM was introduced primarily for precipitation nowcasting (Shi et al., 2015), but subsequent research has applied and extended the core concept across computational physics, medical imaging, video understanding, speech, sequence prediction, and scientific data assimilation.

1. Core Mathematical Formalism and Gate Structure

In a standard LSTM, gate updates use weight matrices on (possibly flattened) vectors, modeling global correlations. ConvLSTM reformulates these computations with convolutional kernels, preserving local spatial layout and yielding spatial memory units at each “pixel” or “node” location. In the canonical formulation (Shi et al., 2015), each input $X_t \in \mathbb{R}^{C \times H \times W}$ , hidden state $H_{t-1} \in \mathbb{R}^{K \times H \times W}$ , and cell state $C_{t-1} \in \mathbb{R}^{K \times H \times W}$ . The update equations are:

$\begin{aligned} i_t &= \sigma\bigl(W_{xi} * X_t + W_{hi} * H_{t-1} + W_{ci} \circ C_{t-1} + b_i\bigr) \ f_t &= \sigma\bigl(W_{xf} * X_t + W_{hf} * H_{t-1} + W_{cf} \circ C_{t-1} + b_f\bigr) \ \tilde{C}_t &= \tanh\bigl(W_{xc} * X_t + W_{hc} * H_{t-1} + b_c\bigr) \ C_t &= f_t \circ C_{t-1} + i_t \circ \tilde{C}_t \ o_t &= \sigma\bigl(W_{xo} * X_t + W_{ho} * H_{t-1} + W_{co} \circ C_{t} + b_o\bigr) \ H_t &= o_t \circ \tanh(C_t) \end{aligned}$

$*$ : Convolution over the spatial grid
$\circ$ : Hadamard (element-wise) product
$\sigma$ : Sigmoid activation
$W_{*}$ : Learnable kernel weights—input-to-gate, hidden-to-gate, and (optionally) peephole (typically $k \times k$ spatial windows)
$b_{*}$ : Bias terms (broadcast over space)

Multi-kernel, higher-order, reduced-gate, and graph-structured generalizations replace the $*$ operators or the gate connectivity patterns but preserve this underlying framework (Agethen et al., 2019, Su et al., 2020, Elsayed et al., 2018, Kim et al., 20 Nov 2024).

2. Model Architectures and Design Variations

2.1 Stacking and Network Topologies

ConvLSTM units are used as building blocks in both encoder–decoder and deep hierarchical stacks:

Encoder–Forecasting: Used for video prediction or radar nowcasting; observed frames are encoded via ConvLSTM layers, generating hidden states that serve as initial conditions for a forecasting stack that outputs future frames (Shi et al., 2015).
Multi-Scale Integration: Placing ConvLSTM blocks at multiple spatial resolutions (e.g., in U-Nets, ResNets) allows the model to capture fine and coarse spatiotemporal dynamics (Arbelle et al., 2018, Nabavi et al., 2018, Courtney et al., 2019).
Dual-Stream and Bidirectional: Dual ConvLSTM networks process different feature resolutions or directions in parallel before fusing their outputs (Ye et al., 2020, Nabavi et al., 2018).
Graph-Structured Convolution: Generalizations embed graph convolution in place of spatial convolutions when the underlying topology is not a regular grid (Kim et al., 20 Nov 2024).

2.2 Parameter Reduction and Efficiency

Reduced-Gate ConvLSTM (rgcLSTM): Collapses multiple gates to a single “module gate,” reducing parameter count by ≈40% and training time by 25–39% on spatiotemporal prediction tasks. Empirical performance is preserved or improved versus standard ConvLSTM, making rgcLSTM favorable for hardware-constrained scenarios (Elsayed et al., 2018).
Tensor-Train Decomposition: Exploits low-rank temporal correlations in higher-order ConvLSTM transition kernels, reducing cubic to linear scaling with respect to the transition order and the number of parameters (Su et al., 2020).

3. Empirical Performance and Applications

3.1 Benchmark Results

ConvLSTM and its variants have established state-of-the-art or highly competitive results across a variety of domains:

Precipitation Nowcasting: Two ConvLSTM layers (each 64 channels, 3×3 kernels) achieved rainfall-MSE = 1.420, CSI = 0.577, FAR = 0.195, POD = 0.660, Correlation = 0.908—outperforming FC-LSTM and operational ROVER (Shi et al., 2015).
Network Traffic Estimation: 2D Conv-LSTM layer (f=8, 3×3 kernels) achieved the lowest test MSE ≈ 0.141 (Γ=15) among CNN, LSTM, and hybrid baselines, excelling at joint cross-feature and temporal correlations, particularly during saturation (Waczynska et al., 2021).
Video Analysis: Multi-kernel ConvLSTM (3×3 and 5×5, with 1×1 fusion) delivered 74.09% top-1 accuracy on UCF-101—+2.8% over the best single kernel; 97.46% on a 17-class I3D-based UCF subset (Agethen et al., 2019). Deep ConvLSTM models reached 85.2% (state-of-the-art) on LRW lipreading with LRS2 pretraining (Courtney et al., 2019).
Wildland Fire Modeling: ConvLSTM (10 layers, 20 filters each) reached JSC ≈ 0.85 on the hardest “wind-slope” dataset (step 50), with 20–30% lower error than a CNN baseline (Burge et al., 2020).
Biomedical Sequence Tasks: Encoder-side ConvLSTM integration in U-Net yields SEG=0.874 on challenging cell segmentation tasks; convolutional-LSTM outperformed all prior sequence-only methods for protein subcellular localization, achieving accuracy 0.902 on MultiLoc (Arbelle et al., 2018, Sønderby et al., 2015).

3.2 Ensemble Methods

In time series forecasting (e.g., power grid frequency), independent Conv1D+LSTM (“ConvLSTM”) models per unit (building) can be ensembled for global predictions. A weighted sum of building-level predictions achieved MSE=0.001400, outperforming all single-building models (Sathe et al., 2023).

3.3 Comparative Insights

ConvLSTM generally surpasses LSTM, CNN, and CNN–LSTM hybrids for spatiotemporal sequence modeling, particularly for long-range prediction, structured sequence inputs, and heterogeneous spatial layouts (Shi et al., 2015, Waczynska et al., 2021).
Simple 1D-convolution + LSTM “ConvLSTM” hybrids may sometimes be used, though these lack the spatial structure of full ConvLSTM architectures (Sathe et al., 2023).
3D convolutional alternatives increase the temporal receptive field only proportional to spatial downsampling, limiting adaptivity, whereas ConvLSTM maintains full sequence context at each scale (Courtney et al., 2019).

4. Implementation Practices and Limitations

4.1 Training Protocols

Loss functions are task-dependent: per-pixel (cross-entropy for segmentation/nowcasting), per-frame MSE or MAE for regression/forecasting, or cross-entropy for classification.
Common optimizers: Adam (often LR ≈ 1e-4...1e-3), RMSProp; typical use of gradient clipping and, where needed, early stopping on validation loss (Shi et al., 2015, Burge et al., 2020, Waczynska et al., 2021).
Regularization is often implicit via parameter sharing in convolutions; explicit dropout is less common and, in some cases, degrades performance (Burge et al., 2020).
Data augmentation (spatial and temporal) is widely used in vision and biomedical pipelines (Arbelle et al., 2018, Nabavi et al., 2018, Courtney et al., 2019).

4.2 Complexity Analysis

ConvLSTM parameter count scales with channel count, kernel size, and number of layers. Stacking ConvLSTM layers increases spatial and temporal receptive field but raises computational and memory requirements.
Efficiency can be improved by reducing gate count (Elsayed et al., 2018) or via tensor-train compression (Su et al., 2020).

4.3 Documented Gaps

Some applied works employ “ConvLSTM” to describe architectures consisting of a CNN or Conv1D front-end followed by a standard LSTM (non-convolutional), which does not model spatial recurrence within the LSTM cell. The difference is crucial for true spatiotemporal modeling (Sathe et al., 2023).
Key implementation details—such as kernel sizes, number of channels, or batch sizes—are occasionally omitted or set via family resemblance to reference models (e.g., Shi et al.), requiring replication studies to infer optimal settings (Sathe et al., 2023, Waczynska et al., 2021).

5. Extensions and Advanced Variants

Multi-Kernel ConvLSTM: Embeds parallel sets of kernels (e.g., 3×3 and 5×5 within each gate) to enable multi-scale spatiotemporal adaptation. Flow-driven attention can guide kernel specialization, yielding gains of 1–3% on large-scale video classification (Agethen et al., 2019).
Higher-Order Memory: Retains and combines multiple past hidden states using learned convolutions whose kernels are parameterized by a tensor-train decomposition; this structure achieves state-of-the-art prediction while reducing parameter and FLOP counts (Su et al., 2020).
Graph-Structured ConvLSTM: Replaces Euclidean spatial convolutions with adjacency-weighted aggregations (e.g., double-hop line-graph convolutions), extending LSTM gates to irregular topologies such as power grids; achieves improved coverage and interval precision in dynamic line rating for power systems (Kim et al., 20 Nov 2024).
Reduced-Gate Architectures: Eliminates redundancy by collapsing gate functions, preserving prediction accuracy under tight resource constraints, e.g., achieving 0.0090 MSE and 0.924 SSIM versus baseline MSE=0.0110, SSIM=0.915 in video prediction (Elsayed et al., 2018).

6. Representative Applications and Practical Outcomes

ConvLSTM variants serve in a diverse range of domains:

Application Domain	Notable Result / Metric	Reference
Precipitation nowcasting	Rainfall-MSE: 1.42 (ConvLSTM), beats ROVER/FC-LSTM	(Shi et al., 2015)
Wildland fire simulation	JSC: 0.85 at step 50, ~20–30% lower error vs CNN	(Burge et al., 2020)
Network traffic prediction	Test MSE: 0.141 (ConvLSTM, best), low variance	(Waczynska et al., 2021)
Video action recognition	Top-1: 74.09% (multi-kernel+fuse), UCF-101	(Agethen et al., 2019)
Biomedical cell segmentation	SEG: 0.874 (EncLSTM), top-2 in CTC challenge	(Arbelle et al., 2018)
Protein subcellular localization	Accuracy: 0.902 (ensemble)	(Sønderby et al., 2015)
Power grid frequency (ensemble)	MSE: 0.001400 (ConvLSTM ensemble)	(Sathe et al., 2023)
Probabilistic forecasting (DLR)	–4.9 pp ACE, –2.6 pp PINAW, fewest params	(Kim et al., 20 Nov 2024)

ConvLSTM thereby constitutes a foundational primitive for deep spatiotemporal processing, underlying progressive advances in environmental modeling, biomedicine, scientific simulation, and large-scale time series forecasting. Ongoing research deploys its principles to structured and irregular domains, emphasizing parameter efficiency, expressive power, and robust sequence modeling.

7. Conceptual Significance and Limitations

ConvLSTM injects local, parameter-shared inductive biases into recurrent architectures, substantially reducing overfitting risk, parameter count, and sample complexity versus fully connected RNNs for spatially structured data. It generalizes naturally to multi-scale stacking, graph convolutions, and memory-compressed variants. However, models remain data- and computation-intensive, particularly in deep stacks or with large spatial grids; they may underperform on very small training cohorts or where spatiotemporal structure is weak (Sathe et al., 2023, Mathew et al., 2023). Further, implementations labeled as “ConvLSTM” may diverge in architectural detail, emphasizing the need for precise description of gate computation when reporting empirical results. The architecture’s flexibility continues to foster innovative design for domain-specific challenges in sequence learning and spatiotemporal inference.