Temporal Difference Block for Sequence Modeling

Updated 8 December 2025

Temporal Difference Block (TDB) is a modular neural architecture that explicitly encodes dynamic temporal changes by integrating local differences with global sequence modeling.
It leverages methods like 3D temporal-difference convolution and transformer-based differencing to enhance tasks in video analysis, medical imaging, and multimodal retrieval.
The block’s integration of explicit temporal differencing improves performance metrics, demonstrating enhanced accuracy and efficiency across diverse sequence modeling applications.

The Temporal Difference Block (TDB) is a modular neural architecture designed to explicitly encode dynamic changes between temporal states for sequence modeling tasks in computer vision and medical imaging. By introducing inductive biases through temporal differencing, TDBs enable efficient representation of local and global temporal dependencies and have demonstrated measurable performance gains in tasks such as remote photoplethysmography (rPPG), video-text retrieval, longitudinal MRI segmentation, and video super-resolution (Luo et al., 18 Sep 2024, Fang et al., 2021, Rokuss et al., 20 Sep 2024, Xiao et al., 2023).

1. Architectural Motivation and General Principles

The principal motivation behind TDB design is to enable explicit modeling of rapid local changes ("temporal differences") and spatial-temporal correlations across sequential data. In video and temporal medical imaging:

CNNs have limited receptive fields and cannot integrate long-range dependencies.
Transformers are global but inefficient for lengthy sequences and struggle to capture high-frequency inter-frame changes.
State-space models (SSMs), and selective scan architectures (e.g., Mamba), scale better for long-range dependencies but require preconditioning on local differences.

TDBs integrate local differencing operators with global sequence modeling, especially bidirectional state-space layers or temporal transformers. This sharpened focus on change detection and feature fusion affords TDB-equipped networks superior robustness to artifact and noise and improved physiologically relevant signal extraction.

2. Mathematical Formulation and Variants

2.1 Temporal Difference Convolution (PhysMamba)

The TDB in "PhysMamba" (Luo et al., 18 Sep 2024) comprises a 3D temporal-difference convolution (TDC) defined as:

$\text{TDC}(x)\,(p_0) = \sum_{p_n\in \mathcal R} w(p_n)\,x(p_0 + p_n) + \theta\left(-x(p_0)\right)\sum_{p_n\in\mathcal R'}w(p_n)$

Where:

$x \in \mathbb{R}^{B \times C \times T \times H \times W}$ is the per-frame feature map,
$\mathcal R$ indexes a $3 \times 3 \times 3$ cube around location $p_0$ ,
$\mathcal R' \subset \mathcal R$ are neighboring temporal offsets,
$w(\cdot)$ are learnable weights,
$\theta \in [0,1]$ balances the pure difference term.

The output is further processed via batch normalization and ReLU.

2.2 Transformer-Based Differencing (CLIP2Video)

In video-text retrieval (Fang et al., 2021), the TDB computes adjacent frame differences:

$\Delta f^t = f_f^t - f_f^{t-1}, \quad t = 1, \dots, m-1$

These are normalized via a transformer attention layer with positional and type embeddings, then interleaved with frame tokens and passed through a temporal transformer. Only frame tokens are retained for global pooling.

2.3 Difference Weighting in Longitudinal Segmentation

In longitudinal MS lesion segmentation (Rokuss et al., 20 Sep 2024), the Difference Weighting Block applies:

$D = x_n - x_p \ A = \text{InstanceNorm}(D) \ x'_n = x_n \odot A + x_n$

Where $x_n$ and $x_p$ are corresponding encoder features from follow-up and baseline volumes. The weighted difference modulates the current features and propagates them through the network.

2.4 Local and Global Temporal Difference Modules (LGTD)

For satellite video super-resolution (Xiao et al., 2023), S-TDM and L-TDM modules extract short-term RGB differences for local motion ( $D_s$ ) and cross-frame global differences for long-term compensation ( $F_l$ ), using multi-scale fusion and residual aggregation.

3. Data Flow and Block-Level Pipeline

The functional pipeline of a TDB typically consists of:

Extraction of per-frame/spatial features by encoders or pretrained backbones.
Calculation of temporal differences between neighboring frames (or timepoints).
Processing difference maps/tokens via normalization, attention, or learned gating (InstanceNorm, transformer layers, etc.).
Fusion of differenced features back into the main sequence via aggregation, gating, and residual connections.
Optionally, integration with global sequence models (SSMs, transformers) or further fusion via attention or alignment blocks.

For PhysMamba (Luo et al., 18 Sep 2024), the block-level sequence is:

Stage	Operation	Output Shape
[1]	3×3×3 TDC + temporal difference	$B,C,T,H,W$
[2]	BatchNorm → ReLU	$B,C,T,H,W$
[3]	Flatten to sequence	$B,L,C$ ( $L=T\cdot H\cdot W$ )
[4]	LayerNorm → Linear → split ( $x$ , $z$ )	$B,L,2\cdot E\cdot C$
[5]	Bi-Mamba forward/reverse scans + gating	$B,L,C$
[6]	LayerNorm → reshape	$B,C,T,H,W$
[7]	Channel Attention	$B,C,T,H,W$

4. Application Domains and Integration Strategies

TDBs have been instantiated in architectures across several domains:

Facial-video rPPG (PhysMamba): TDB modules sit at the core of a dual-stream SlowFast architecture, enabling multi-scale fusion of physiological features. Temporal differences are amplified locally, then modeled globally, resulting in state-of-the-art accuracy for vital-sign estimation (Luo et al., 18 Sep 2024).
Video-Text Retrieval (CLIP2Video): TDBs inject explicit motion tokens and difference-level attention between frame embeddings, improving retrieval metrics by aligning actions and textual queries via motion-aware representations (Fang et al., 2021).
Longitudinal MRI Segmentation: TDBs act at skip connection fusion points, using instance-normalized differences between baseline and follow-up U-Net features to boost lesion detection and segmentation scores (Rokuss et al., 20 Sep 2024).
Video Super-Resolution (LGTD): S-TDM and L-TDM modules capture local motion and global temporal structure from RGB differences in video frames, with specialized difference compensation units and hybrid long-short attention blocks for reconstruction (Xiao et al., 2023).

5. Empirical Impact and Ablation Results

Across multiple tasks, the inclusion of TDB leads to quantifiable improvements:

PhysMamba: Ablation on UBFC-rPPG reveals that TDB stages are essential—removing temporal differencing worsens MAE from 0.54 to 0.68 bpm; omitting the global Mamba scan raises MAE to 0.63 bpm. The model achieves high efficiency (0.56 M parameters, 47.3 G MACs at $128^3$ input) and robust vital-sign tracking over long sequences (Luo et al., 18 Sep 2024).
CLIP2Video: Full TDB increases top-1 text-to-video retrieval R@1 on MSR-VTT by 0.6 points compared to raw subtraction or MLP difference encoding. Combined with temporal alignment, gains reach +1.1 R@1 versus baseline architectures (Fang et al., 2021).
Longitudinal Difference Weighting: On the Ljubljana MS dataset, TDB-augmented segmentation yields a +1.45% Dice increase and +2.01% lesion-F1 improvement over single-timepoint baselines; on ISBI 2015, Dice and F1 improvements are sustained (Rokuss et al., 20 Sep 2024).
LGTD: S-TDM and L-TDM modules provide complementary local/global motion cues, with hybrid attention boosting spatial consistency and reconstruction accuracy in satellite video super-resolution (Xiao et al., 2023).

6. Implementation and Hyperparameter Summary

The instantiation of TDBs is architecture-specific but typically features:

Temporal difference kernel sizes (e.g., 3×3×3 convolution for PhysMamba).
Learnable weights for difference amplification ( $\theta$ parameter).
Normalization layers (BatchNorm, InstanceNorm) for stable optimization.
Joint use with global sequence models (Mamba, Temporal Transformer).
Minimal parameter expansion, enabling efficiency and scalability (e.g., PhysMamba: expansion factor $E=2$ ).
Domain-specific preprocessing (affine registration in MRI, windowed sampling in video).

Hyperparameters are tuned to balance local context amplification with global integration, e.g., frame window length $T=128$ , kernel size, instance normalization, and expansion factors (Luo et al., 18 Sep 2024, Fang et al., 2021, Rokuss et al., 20 Sep 2024, Xiao et al., 2023).

The Temporal Difference Block has emerged as a generalizable module for encoding temporal change, adaptable across vision, medical, and multimodal retrieval tasks. Its utility centers on explicit differencing, effective gating, and fusion of spatio-temporal cues, consistently advancing performance and efficiency wherever local and global temporal context is critical.