ConvGRU: Convolutional Gated Recurrent Unit

Updated 26 February 2026

ConvGRU is a recurrent cell that uses convolution operations to capture spatial and temporal dependencies in high-dimensional structured data.
By replacing dense matrix multiplications with local convolutions, ConvGRU reduces parameter counts and enhances efficiency for tasks like video segmentation and action recognition.
Empirical studies show ConvGRU improves performance metrics, such as F-measure and IoU, while enabling faster convergence in spatio-temporal modeling.

A Convolutional Gated Recurrent Unit (ConvGRU) is a recurrent neural network cell designed to model spatio-temporal dependencies in high-dimensional structured data such as videos, spatial sensor grids, or feature maps. It is a direct variant of the Gated Recurrent Unit (GRU), where all vector-matrix multiplications in the standard GRU are replaced by convolutional operations, thus preserving spatial locality and dramatically reducing parameter count compared to fully connected alternatives. ConvGRUs have demonstrated strong empirical performance for several sequence modeling domains, including video segmentation, action recognition, and spatio-temporal forecasting (Valipour et al., 2016, Ballas et al., 2015, Siam et al., 2016, Zhao et al., 2022, Yang et al., 2024).

1. Mathematical Formulation of ConvGRU

The ConvGRU cell transforms the standard GRU equations from dense matrix products to convolutional operations. Let $x_t \in \mathbb{R}^{H \times W \times C}$ denote the input feature map at time $t$ , and $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ be the hidden state from the previous step. The ConvGRU recurrence is given by:

$\begin{aligned} z_t &= \sigma\bigl(W_{xz}\ast x_t \;+\; W_{hz}\ast h_{t-1}\;+\; b_z\bigr) \ r_t &= \sigma\bigl(W_{xr}\ast x_t \;+\; W_{hr}\ast h_{t-1}\;+\; b_r\bigr) \ \tilde h_t &= \phi\bigl(W_{x}\ast x_t \;+\; W_{h}\ast (r_t \odot h_{t-1})\;+\; b\bigr) \ h_t &= (1 - z_t)\odot h_{t-1}\;+\; z_t\odot \tilde h_t \end{aligned}$

$\ast$ denotes 2D convolution (stride 1, padding to preserve $H \times W$ )
$\odot$ is the element-wise (Hadamard) product
$\sigma(\cdot)$ denotes the sigmoid function
$\phi(\cdot)$ is typically ReLU or tanh
$W_{xz}, W_{xr}, W_x \in \mathbb{R}^{k \times k \times C \times F}$ ; $t$ 0; $t$ 1

This formulation maintains the spatial structure, as every operation is performed locally (via convolution) rather than through global mixing as in fully connected GRUs (Valipour et al., 2016, Ballas et al., 2015, Siam et al., 2016, Jung et al., 2017, Yang et al., 2024).

2. Cell Architecture, Parameterization, and Computational Characteristics

The architecture of a ConvGRU cell is defined by the dimensionalities and arrangements of its inputs, hidden states, convolution kernels, and activation functions. Each gate (reset and update) is computed by summing the results of two parallel convolutions (input and hidden-to-gate), adding a bias, and applying the sigmoid. The candidate state convolution incorporates the reset gate via pointwise product with the previous hidden state.

Typical settings:

Kernel size $t$ 2 (optionally $t$ 3 in early layers)
Padding $t$ 4 to preserve spatial dimensions
Output channels $t$ 5 in the range of 32 to 256, matching downstream task demands
ReLU activation used for candidate state for faster convergence
All weights are shared across spatial locations (via convolutions), yielding parameter counts $t$ 6 per cell, several orders of magnitude smaller than the fully connected alternative ( $t$ 7) (Valipour et al., 2016, Ballas et al., 2015, Siam et al., 2016)

The following table highlights comparative parameter scaling:

Architecture	Parameter Count (per gate)	Spatial Preservation
Fully Connected GRU	$t$ 8	No (global mixing)
ConvGRU	$t$ 9	Yes (local, kernel)

ConvGRU thus enables deep spatio-temporal modeling on feature maps of substantial size (e.g., $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 0 spatial grids) (Valipour et al., 2016).

3. Integration in Spatio-Temporal Architectures

ConvGRU units are typically inserted into convolutional neural network stacks to augment temporal modeling, especially in tasks such as video segmentation (Valipour et al., 2016, Siam et al., 2016), video-level representation learning (Ballas et al., 2015), and spatio-temporal graph modeling (Zhao et al., 2022, Dong et al., 2024, Yang et al., 2024).

Key integration strategies:

Video segmentation: ConvGRU receives a temporal sequence of convolutional feature maps from a "backbone" CNN (e.g., VGG-F up to conv5) and produces hidden states which are further processed (e.g., via 1×1 conv, upsampling) to generate per-frame segmentations (Valipour et al., 2016, Siam et al., 2016).
Action recognition: Multi-level percepts from a pretrained CNN (multiple depths) are passed to parallel or stacked ConvGRUs to capture fine-to-coarse motion (Ballas et al., 2015).
Graph-structured data: The convolution operator in ConvGRU is replaced by spatial graph convolutions or diffusion convolutions for spatial-temporal graph data, as detailed in DGCGRU and DCGRU models (Zhao et al., 2022, Dong et al., 2024).

ConvGRUs operate in online mode by carrying $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 1 forward across frames, supporting efficient real-time inference without requiring full video sequences (Valipour et al., 2016). For batch (unrolled) training, the cell supports backpropagation through time over temporal windows.

4. Applications and Empirical Performance

Video segmentation: ConvGRU-based architectures (RFC-VGG, RFC-LeNet) achieve consistent 2–5 point absolute increases in F-measure and IoU over non-recurrent FCNs on SegTrack V2, DAVIS, Synthia, and Cityscapes (Valipour et al., 2016, Siam et al., 2016). SegTrack V2 F-measure improves from 72.54% (FCN) to 77.67% (ConvGRU), and Synthia mean IoU from 75.5% to 81.2%.

Action recognition and captioning: ConvGRU stacked on multi-scale CNN percepts yields +3.4% gain versus fully connected GRUs on UCF-101 action recognition and notable BLEU/METEOR/CIDEr score gains on YouTube2Text video captioning (Ballas et al., 2015).

Contextual video understanding: ConvGRU with adaptive detrending (AD) or combinations with batch/layer normalization enables faster convergence (up to 30–50% speedup) and boosts top-1 accuracy on object-action and object-action-modifier video datasets by 1–3 percentage points over ConvGRU without AD (Jung et al., 2017).

Bird's-eye-view segmentation: Substituting 3D CNNs with ConvGRU or Geo-ConvGRU in BEV semantic segmentation improves IoU and PQ, outperforming state-of-the-art approaches such as ST-P3 and FIERY (Yang et al., 2024).

Graph-based spatio-temporal forecasting: Diffusion and double-graph convolutional GRUs (DCGRU, DGCGRU) extend ConvGRU to sensor and road network graphs, enabling competitive or superior accuracy for traffic and pedestrian volume prediction (Dong et al., 2024, Zhao et al., 2022).

5. Extensions and Recent Innovations

ConvGRU generalizes to various types of convolution:

Spatial 2D convolution: Standard for video/image features (Valipour et al., 2016, Ballas et al., 2015, Siam et al., 2016)
Graph convolution: For graph-structured spatio-temporal data, e.g., DGCGRU with double graph convolutional gates fusing distance-based and self-adaptive spatial dependencies (Zhao et al., 2022)
Diffusion convolution: For modeling random-walk style influences in traffic/pedestrian graphs (Dong et al., 2024)

Adaptive Detrending (AD): Provides a temporal normalization strategy by treating the hidden state update as an exponential moving average and subtracting the modeled trend, yielding improved learning dynamics and consistency, especially when paired with batch or layer normalization (Jung et al., 2017).

Geographical masking: Geo-ConvGRU uses a spatial mask to suppress updates in unobserved BEV regions, further stabilizing temporal fusion in tasks with missing or occluded inputs (Yang et al., 2024).

6. Comparisons, Limitations, and Practical Considerations

Comparison with fully connected GRU: ConvGRU offers a drastic parameter reduction, e.g., with $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 2, the FC-GRU would require $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 3 parameters, while ConvGRU uses $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 4—orders of magnitude smaller (Valipour et al., 2016). ConvGRUs preserve 2D topology and efficiently capture local motion, which fully connected GRUs discard.

Comparison with ConvLSTM: In seizure detection, ConvLSTM outperforms ConvGRU on specificity and false alarm rate, attributed to LSTM's explicit memory cell. However, ConvGRU trains $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 510% faster and is less prone to overfitting due to lower complexity (Golmohammadi et al., 2018).

Initialization and regularization: Orthogonal kernel initialization is optimal for stable training, especially for long sequences. Regularization strategies include combined $h_{t-1} \in \mathbb{R}^{H \times W \times F}$ 6 penalties and moderate dropout (not excessive on kernel weights), critical for preventing overfitting (Golmohammadi et al., 2018).

Training regimes: ConvGRUs are compatible with BPTT and sequence-to-sequence frameworks, can be unrolled over fixed or variable-length temporal windows, and support both sample-level and batch-level normalization and detrending.

7. Outlook and Research Directions

ConvGRU has proven broadly adaptable to spatio-temporal sequence modeling in visual and sensor domains. Ongoing research includes:

Further fusion with Transformer modules for hybrid spatial-temporal reasoning, especially in the context of long-term dependency modeling (Yang et al., 2024)
Enhanced normalization techniques, including robust detrending and adaptive gating for long-sequence stability (Jung et al., 2017)
Graph and diffusion convolution generalizations for arbitrary spatial topology (Zhao et al., 2022, Dong et al., 2024)
Domain-specific masking and gated attention for handling missingness and spatial priors (e.g., in BEV or occlusion settings) (Yang et al., 2024)

ConvGRU continues to serve as a foundational spatio-temporal building block for modern deep learning architectures requiring both locality and memory, with empirically verified benefits across a range of video, vision, and spatio-temporal forecasting applications.