Complex Temporal Alignment GRU

Updated 25 November 2025

CTA-GRU is a complex-valued recurrent unit that aligns spatiotemporal features by fusing low-light video frames with high-temporal-resolution event data.
It employs bidirectional processing and complex convolutions to integrate modality-specific cues, enhancing deblurring performance in challenging environments.
Quantitative evaluations demonstrate that its design significantly improves PSNR compared to static or real-valued alternatives, addressing temporal misalignments effectively.

A Complex Temporal Alignment GRU (CTA-GRU) is a recurrent neural module designed for precise spatiotemporal alignment and continuous fusion when processing multi-modal sequential data—specifically, complex-valued representations that encode complementary signals such as low-light video frames (real part) and corresponding high-temporal-resolution event streams (imaginary part). Emerging in the context of low-light video deblurring and jointly restoring video signals degraded by both poor illumination and motion blur, the CTA-GRU generalizes the standard gated recurrent unit (GRU) by extending both its internal arithmetic and gating mechanisms to the complex domain, and by employing bidirectional temporal processing for enhanced context aggregation and temporal alignment (Zhong et al., 18 Nov 2025).

1. Formulation and Functional Overview

CTA-GRU is introduced in the CompEvent architecture for holistic video restoration in challenging low-light conditions. At each time step $t$ , the model receives a pair of inputs: an RGB frame $I_t \in \mathbb{R}^{H \times W \times 3}$ and a synchronous event image $E_t \in \mathbb{R}^{H \times W \times C_E}$ . Channel-wise, these are fused into a single complex-valued tensor

$Z_t = \mathcal{F}_\mathbb{R}(I_t) + i\,\mathcal{F}_\mathbb{I}(E_t) \in \mathbb{C}^{H \times W \times C}$

where $\mathcal{F}_\mathbb{R}$ and $\mathcal{F}_\mathbb{I}$ are per-modality feature extractors. The recurrent CTA-GRU aligns and integrates these features over time, leveraging bidirectional passes so that each output $H'_t$ encodes both forward and backward temporal contexts, yielding $H'_t \in \mathbb{C}^{H \times W \times 2C}$ (Zhong et al., 18 Nov 2025).

2. Architectural Structure and Core Equations

The architecture comprises a per-frame embedding block, followed by a bidirectional complex-valued GRU applied iteratively over time. The primary recurrence formulas, with all operations in the complex domain, are as follows:

Reset gate:

$r_t = \sigma_c \big(CConv_r \big( [Z_t, H_{t-1}] \big)\big)$

Update gate:

$z_t = \sigma_c \big(CConv_u \big( [Z_t, H_{t-1}] \big)\big)$

Candidate hidden state:

$\tilde{h}_t = \tanh_c\big(CConv_h\big([ Z_t, r_t \odot H_{t-1} ]\big)\big)$

Final hidden state:

$h_t = (1 - z_t) \odot H_{t-1} + z_t \odot \tilde{h}_t$

Here, $CConv_\ast$ denote complex-valued convolutional layers with learnable real and imaginary parts; $\sigma_c(\cdot)$ and $\tanh_c(\cdot)$ apply the corresponding nonlinearity separately to the real and imaginary parts. Channel concatenation is denoted $[\cdot,\cdot]$ , and $\odot$ is the elementwise complex product. The recurrent pass is performed in both temporal orders, then concatenated at each position to form the temporally aligned feature $H'_t$ (Zhong et al., 18 Nov 2025).

3. Complex-valued Operations and Temporal Alignment Mechanisms

By encoding the RGB features into the real domain and event features into the imaginary domain, CTA-GRU enables explicit and continuous interaction via complex algebra. The reset and update gates, operating over concatenated $[Z_t, H_{t-1}]$ , yield gating signals sensitive to both motion (via events) and appearance/content (via RGB):

The reset gate $r_t$ can selectively suppress outdated features in $H_{t-1}$ wherever the event signal (imaginary component) indicates abrupt scene changes.
The update gate $z_t$ adaptively controls the blend between new fused features and prior state, modulating integration according to both appearance and rapid motion cues.

This complex gating, combined with bidirectional recurrence, achieves robust temporal alignment by leveraging both historical and future information at each location. A plausible implication is that such alignment resolves temporal lags and misregistration caused by non-invertible motion or rapid scene transitions, which standard real-valued or unidirectional mechanisms struggle to address (Zhong et al., 18 Nov 2025).

4. Implementation Details and Pseudocode

All major operations in CTA-GRU are implemented via complex-valued 2D convolutions ( $3 \times 3$ kernel, stride $1$, padding $1$). The complex convolutions are defined as: $\mathrm{CConv}(M) = (K_R * M_R - K_I * M_I) + i \,(K_R * M_I + K_I * M_R)$ where $M = M_R + i M_I$ and $(K_R, K_I)$ are the real and imaginary parts of the kernel.

Activation functions ( $\sigma$ , $\tanh$ ) are split elementwise over real and imaginary components. Hidden size $C=64$ or $128$ is typical. Complex Layer Normalization may be used for training stability.

Bidirectional processing is performed for each block over temporally adjacent frames. Representative forward pseudocode is:

H_fwd[0] = 0
for t in 1..T:
    X = concat(Z[t], H_fwd[t-1])           # complex concat
    r_t = sigmoid_c(CConv_r(X))
    z_t = sigmoid_c(CConv_u(X))
    X2 = concat(Z[t], r_t ⊙ H_fwd[t-1])
    h_tilde = tanh_c(CConv_h(X2))
    H_fwd[t] = (1 - z_t) ⊙ H_fwd[t-1] + z_t ⊙ h_tilde

Final output at each step is

H'_t = \mathrm{concat}(H_{\mathrm{fwd}}[t], H_{\mathrm{bwd}}[t])

(Zhong et al., 18 Nov 2025).

5. Quantitative Evidence and Comparative Evaluation

Ablation experiments on the RELED benchmark assess the direct benefit of CTA-GRU. Full CompEvent with complex-valued CTA-GRU achieves PSNR $32.51$ dB. Omitting the temporal GRU yields $30.87$ dB ( $-1.64$ dB), while replacing the GRU with simple cross-frame concatenation attains $31.93$ dB ( $-0.58$ dB). Replacing complex-valued gates and convolutions with real counterparts further decreases PSNR by $1.12$ dB.

These results substantiate that:

Complex-valued, bidirectional temporal modeling yields a substantial performance boost over static or non-learned temporal aggregation.
The advantage is not solely due to increased parameterization; rather, the architecture's explicit modeling of cross-modal interactions and temporal dependencies is critical (Zhong et al., 18 Nov 2025).

6. Limitations and Extensions

CTA-GRU incurs increased compute and memory costs, as complex-valued operations double the number of real-valued multiply-accumulate operations and feature channels. Fixed-size convolutional kernels ( $3 \times 3$ ) may limit motion compensation when frame misalignments are large; deformable or dilated complex convolutions are prospective improvements. Replacing the GRU with a complex-valued LSTM could afford even richer temporal gating at the cost of higher complexity. Integrating explicit cross-modal complex self-attention or migrating to continuous-time complex RNNs represents a promising direction for handling asynchronous, high-rate event streams (Zhong et al., 18 Nov 2025).

7. Contextualization within Temporal Alignment GRU Variants

CTA-GRU stands in contrast to other recurrent temporal alignment modules, such as:

Bidirectional gated architectures in vision-language tasks (e.g., the GRU-driven temporal cross-attention in SurgAnt-ViVQA), which achieves state-of-the-art BLEU-4 for surgical anticipation via bidirectional GRU recurrence and fine-grained adaptive gating on visual-language fused representations, but employs standard real-valued GRUs and attention modules (Dhake et al., 5 Nov 2025).
Continuous-Time and Task-Synchronized GRUs, designed to handle irregular event timing via explicit modeling of temporal decay and integration, yet not providing holistic multi-modal or complex-valued fusion (Mozer et al., 2017, Lukoševičius et al., 2022).

CTA-GRU is therefore distinct in its explicit, trainable complex-valued interleaving of spatial, channelwise, and temporal alignment, specifically targeted at multi-modal fusion for video restoration under challenging conditions.

References:

"CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring" (Zhong et al., 18 Nov 2025)
"SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention" (Dhake et al., 5 Nov 2025)
"Discrete Event, Continuous Time RNNs" (Mozer et al., 2017)
"Task-Synchronized Recurrent Neural Networks" (Lukoševičius et al., 2022)