Temporal Dynamic Modeling (TDM) Block

Updated 19 December 2025

Temporal Dynamic Modeling (TDM) block is an architectural unit designed to capture multi-scale temporal dependencies through cycle operators and residual fusion.
The module integrates Multi-scale Temporal Interaction to process 4-D tensors efficiently, achieving higher accuracy while keeping computational costs low.
Widely used in action recognition, motion prediction, and traffic event anticipation, it offers a lightweight alternative to transformer-based methods.

Temporal Dynamic Modeling (TDM) blocks are architectural units designed to enable fine-grained, multi-scale temporal representation learning in sequential data. Originating in the context of skeleton-based action recognition, notably within the "TSkel-Mamba" framework (Liu et al., 12 Dec 2025), the TDM block acts as a plug-and-play component that captures temporal dependencies and cross-channel interactions over varying temporal ranges. TDM blocks are closely associated with Multi-scale Temporal Interaction (MTI) modules, which implement the core multi-scale fusion within TDM and are also found in related temporal architectures for motion prediction (Lebailly et al., 2020) and traffic event anticipation (Wu et al., 23 Sep 2025).

1. Architectural Role and Data Flow

In systems like TSkel-Mamba (Liu et al., 12 Dec 2025), the TDM block forms one of two principal subcomponents within each Hybrid Transformer–Mamba (HTM) layer, the other being a Spatial Transformer (ST) for per-frame, per-joint spatial modeling. The TDM block processes 4-D tensors $H \in \mathbb{R}^{B \times C \times T \times N}$ (batch, channels, frames, joints). After an initial layer normalization, $1 \times 1$ convolution, batch normalization, and ReLU activation reduce channels to $C/2$ , yielding $\hat{H} \in \mathbb{R}^{B \times (C/2) \times T \times N}$ . The MTI module then applies multiple scale-specific Cycle operators, fusing their outputs via residual addition to produce $\Omega \in \mathbb{R}^{B \times (C/2) \times T \times N}$ . Temporal-only scanning forms forward and backward sequences for subsequent Mamba state-space modeling, followed by gating, linear projection, and temporal pooling to complete the temporal dynamics transformation.

2. Multi-scale Temporal Interaction (MTI) Mechanics

The MTI module, incorporated within TDM, is designed to explicitly encode temporal context at multiple scales. For each spatio-temporal location $(t, n)$ and channel $c$ , the core operation is:

For each kernel size $K$ in $S_K$ (e.g., $\{1,3,5\}$ ), compute a "Cycle" operator, which applies channelwise temporal offsets: $\delta_t(c) = (c \bmod K) - \lfloor K/2 \rfloor$ .
The Cycle operator constructs the output for each scale as

$Y^K_{:,t,n} = W^K \cdot [\hat{H}(0, t+\delta_t(0), n), ..., \hat{H}(C/2-1, t+\delta_t(C/2-1), n)]^T + b^K$

with $W^K \in \mathbb{R}^{C/2 \times C/2}$ , $b^K \in \mathbb{R}^{C/2}$ . Boundary indices are managed via zero-padding or clamping.

The outputs are summed across scales and added as a residual to the original feature: $\Omega = \hat{H} + \sum_{K \in S_K} f_{\mathrm{Cycle}}^K(\hat{H})$ .

This operation enables TDM to robustly capture and propagate local, mid-range, and long-range temporal patterns, as well as facilitate cross-channel temporal interactions essential for complex action recognition.

3. Integration with Mamba State Space Model and Spatial Transformer

After MTI processing, $\Omega$ is reshaped per-joint and per-time, forming parallel forward ( $M^+$ ) and backward ( $M^-$ ) streams. Each stream undergoes linear projection to two branches:

$M_X$ for causal modeling via 1D convolution, SiLU activation, and Mamba state space modeling ( $SSM$ ).
$M_Z$ as a gating branch (SiLU activation).

The Mamba block models long-range, channelwise temporal dependencies, while the gating mechanism modulates the SSM output. The final fusion concatenates forward and (time-flipped) backward results, applies LayerNorm, and temporal pooling to match output temporal resolution. This pipeline preserves compatibility with the underlying SSM recurrence structure and ensures that all temporal modeling remains differentiable and lightweight relative to attention-based alternatives.

4. Computational Complexity and Design Trade-offs

Each Cycle operator within MTI has a computational cost of $O(B \cdot T \cdot N \cdot (C/2)^2)$ multiplications. The SSM-based Mamba processing that follows dominates overall cost, scaling as $O(B \cdot N \cdot C' \cdot T \cdot \mathrm{polylog}\,T)$ . With typical settings ( $C \approx 216, C' \approx 128$ ), the MTI module contributes only a small fraction of overall compute and parameter count. For seven-layer TSkel-Mamba, the total parameter count ($2.4$M) and FLOPs ($8.2$G) remain much lower than spatial-temporal Transformer counterparts (e.g., $12.1$M, $259$G in ST-TR), yet deliver higher accuracy (+5%) in human skeleton action benchmarks (Liu et al., 12 Dec 2025).

The choice of cycle kernel set $S_K$ is empirically important: multi-scale variants ( $K=\{1,3,5\}$ ) outperform single-scale ( $K=3$ ) in both accuracy and robustness. Kernel size and number directly trade off between receptive field, model expressivity, and computational efficiency.

5. Comparison and Relationship to Other Multi-scale Temporal Blocks

TDM with MTI generalizes many other temporal fusion designs:

The Temporal Inception Module (TIM) (Lebailly et al., 2020) also employs multiple temporal convolutional branches with varying kernel sizes for trajectory encoding. TIM splits inputs into multiple subsequences and uses Conv1Ds of different kernel size per branch, concatenating the results to form rich, multi-temporal embeddings prior to GCN-based motion prediction. Empirically, proportional kernel sizing and multi-horizon subsequences improve both short- and long-term forecasts.
In the MsFIN framework for traffic event anticipation (Wu et al., 23 Sep 2025), the Multi-scale Temporal Interaction module constructs three temporal contexts (short, mid, long) using pooling over different time windows. Each pooled vector is fused with instantaneous features using an MLP, and multiple Transformer blocks model inter-object interactions and temporal causality. Multi-scale post-fusion combines the outputs for comprehensive risk assessment.

A key property unifying these designs is that multi-scale temporal perception—whether via Cycle operators (TDM), multi-branch convolutions (TIM), or multi-window pooling with Transformer fusion (MsFIN)—systematically outperforms single-scale temporal modeling, especially in asynchronous or hierarchical temporal scenarios.

6. Empirical Performance and Ablation Results

Experimental investigations isolate the contribution of MTI (within TDM) as responsible for an absolute improvement of $0.6$– $0.9\%$ in Top-1 accuracy over single-scale temporal Mamba on NTU120 X-Sub (Liu et al., 12 Dec 2025). The full TDM block (MTI + bi-directional Mamba) adds $+3.2\%$ over baseline temporal convolutions, with one-third of that gain attributable to the MTI. Inference-time overhead is minimal, adding $\approx 0.08$ GFLOPs and $\approx 0.03$ M parameters per HTM layer relative to the TCN baseline.

The advantage of multi-scale temporal fusion is consistent across domains. In video action recognition, replacing shallow, single-scale temporal modules with CTI (a multi-perception, attention-fused temporal block) in the TSI framework yields $+2.3\%$ Top-1 accuracy on Something-Something V1, and greater improvements in combined settings (Su et al., 2021). In motion prediction, multi-branch TIM achieves lower MPJPE than DCT or fixed-scale convolutional alternatives (Lebailly et al., 2020). MsFIN’s multi-scale module outperforms single-scale feature extractors in both correctness and earliness prediction on DAD/DADA traffic datasets (Wu et al., 23 Sep 2025).

7. Variants and Broader Impact

While the TDM block as deployed in TSkel-Mamba targets skeleton-based action recognition, analogous multi-scale temporal modules (MTI, TIM, CTI) are increasingly adopted in diverse sequential modeling fields including motion forecasting, accident anticipation, and video understanding. Recent SSM-based designs like TDM offer lower complexity and greater efficiency compared to Transformer-based modules, while maintaining or exceeding their accuracy through explicit, parameter-efficient multi-scale interaction. A plausible implication is that future architectures for long-horizon sequence modeling in vision and robotics will increasingly incorporate TDM-style multi-scale temporal fusion in both feed-forward and recurrent paradigms.

References:

"TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition" (Liu et al., 12 Dec 2025)
"Motion Prediction Using Temporal Inception Module" (Lebailly et al., 2020)
"MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation" (Wu et al., 23 Sep 2025)
"TSI: Temporal Saliency Integration for Video Action Recognition" (Su et al., 2021)