MambaTAD: Efficient Temporal Action Detection

Updated 26 November 2025

MambaTAD is a framework for temporal action detection that leverages structured state-space models for long-range temporal analysis.
It integrates the Diagonal-Masked Bidirectional State-Space module, state-space temporal adapter, and global feature fusion head to efficiently localize actions in untrimmed videos.
The design achieves competitive mean average precision with lower computational cost, supporting various backbones and TAD benchmarks.

MambaTAD is a temporal action detection (TAD) framework that integrates structured state-space models (SSMs) with architectural and algorithmic innovations to address the challenges inherent in localizing actions with temporal boundaries in untrimmed videos. MambaTAD utilizes a Diagonal-Masked Bidirectional State-Space (DMBSS) module for long-range temporal modeling, a state-space temporal adapter (SSTA) for efficient end-to-end training, and a global feature fusion head to facilitate multi-scale, global context aggregation. This design delivers high accuracy, competitive computational efficiency, and robust generalization across multiple TAD benchmarks (Lu et al., 22 Nov 2025).

1. Temporal Action Detection: Problem Formulation and Challenges

TAD aims to predict both the label and the start/end timestamps of action instances in continuous, untrimmed video sequences. Given an input feature sequence $\{\mathbf{x}_t\}_{t=1}^T$ , $\mathbf{x}_t \in \mathbb{R}^c$ , the goal is to detect a set of intervals $\{(s_i, e_i, \text{class}_i)\}$ specifying the boundaries and class labels for each action.

Several core challenges define the TAD problem domain:

Long-range dependency: Many actions span hundreds or thousands of frames, necessitating models that can efficiently capture distant temporal correlations.
Temporal-context decay in SSMs: Causal (unidirectional) SSMs, including standard Mamba, suffer from vanishing memory, diminishing their ability to maintain early-frame context in long video sequences.
Self-element (diagonal) conflict in bidirectional SSMs: A naïve sum of forward and backward SSM passes produces excessive diagonal terms, reducing discriminative power for temporally proximal features.
Limited global awareness in anchor-free detection heads: Isolated processing at different scales or locations hinders detection of broad temporal structures, especially in long or multi-instance actions.
High parameter and computational cost for end-to-end adaptation: Full fine-tuning of large video backbones incurs substantial memory and parameter overhead.

2. MambaTAD Architecture and Core Modules

MambaTAD is structured as a one-stage, anchor-free, end-to-end TAD system, comprising three major components:

Diagonal-Masked Bidirectional State-Space (DMBSS) Module: Enables long-range, bidirectional modeling with context-preserving architectural design.
Global Feature Fusion Head: Aggregates multi-granularity features across temporal pyramid scales, delivering unified global context to detection heads.
State-Space Temporal Adapter (SSTA): A parameter- and compute-efficient plug-in that adapts frozen video backbones to the TAD task using SSM principles.

This end-to-end architecture allows integration with various frozen backbones, such as I3D, InternVideo-6B, and VideoMAE, supporting both feature-based and end-to-end training regimes.

3. Diagonal-Masked Bidirectional State-Space (DMBSS) Design

The DMBSS module innovates upon classic SSM operation for the TAD context:

Bidirectional Scanning: Given input $\mathbf{X} \in \mathbb{R}^{s \times c}$ , DMBSS computes two SSM passes: a forward scan ( $\mathbf{M}_{fw}$ , causal) and a backward scan ( $\mathbf{M}_{bw}$ , via time-reversal), summing their outputs. Formally,

$\mathbf{Y} = \mathbf{M}_{fw} \mathbf{X} + J_s \mathbf{M}_{bw} J_s \mathbf{X}$

where $J_s$ represents the time-reversal operator.

Diagonal Masking: The diagonal of the backward recurrent matrix $A_{bw}$ is explicitly zeroed, preventing duplication of self-activation and thus improving discriminability:

$\mathrm{mask\_diagonal}(A_{bw}): A_{bw}(i, i) = 0 \; \forall\, i$

Dual-Branch Design: DMBSS can operate either with shared weights (parameter-sharing) or as independent forward and backward branches (dual-branch, DB), further mitigating context decay.
Residual Block Structure: The block sequence includes LayerNorm, linear expansion and splitting, SSM scans (forward and back), combination, projection, and residual addition.

The DMBSS module preserves the standard Mamba SSM recurrence: $h_t = \bar{A} h_{t-1} + \bar{B} x_t,\qquad y_t = C h_t$ This mechanism supports linear time and space complexity ( $O(T d^2)$ FLOPs, $O(T d)$ memory), offering a practical tradeoff between context range and efficiency.

4. State-Space Temporal Adapter (SSTA) and Global Feature Fusion

State-Space Temporal Adapter (SSTA)

SSTA enables parameter-efficient adapter-based tuning with linear complexity. After each backbone layer (transformer or CNN), SSTA executes:

Down-projection: $\hat{\mathbf{x}} = W_{down}^T \mathbf{x}$
Nonlinearity: $\bar{\mathbf{x}} = \mathrm{GELU}(\hat{\mathbf{x}})$
Temporal modeling and fusion: $\mathbf{x}' = \mathrm{DMBSS}(\hat{\mathbf{x}}) + \hat{\mathbf{x}} + \bar{\mathbf{x}}$
Up-projection and residual: $\mathbf{x}'' = W_{up}^T \mathbf{x}' + \mathbf{x}$

Here $W_{down} \in \mathbb{R}^{d \times (d/\lambda)}$ , $W_{up} \in \mathbb{R}^{(d/\lambda) \times d}$ , with $\lambda > 1$ . SSTA reuses DMBSS logic, adding only $O(d^2/\lambda)$ parameters per block.

Global Feature Fusion Head

After temporal pyramid generation across $L$ scales ( $\mathbf{f}_1, \ldots, \mathbf{f}_L$ ), features are concatenated: $\mathbf{F} = \mathbf{f}_1 \| \cdots \| \mathbf{f}_L$ , yielding length $s + s/2 + \cdots + s/2^{L-1}$ .

LayerNorm and DMBSS are applied: $\mathbf{F}_G = \mathrm{LN}(\mathrm{DMBSS}(\mathrm{LN}(\mathbf{F})) + \mathbf{F})$
The output divides into classification and regression heads (1×1 convolution), jointly leveraging all temporal resolutions.

This design allows detection heads to leverage multi-scale context for both boundary refinement and class prediction.

5. Computational and Empirical Performance

MambaTAD exhibits low parameter and compute requirements relative to prior TAD frameworks:

Method	Backbone	#Params	FLOPs	TH14 Avg mAP (%)	ANet Avg mAP (%)
ActionFormer	I3D	29.2 M	45.2 G	66.8	36.6
TriDet	I3D	17.3 M	43.9 G	69.3	36.6
InternVideo2	InternVideo6B	34.2 M	63.6 G	72.0	41.2
MambaTAD	I3D	10.4 M	17.8 G	69.9	40.2
MambaTAD	InternVideo6B	12.2 M	19.7 G	73.9	42.8

With VideoMAE-Large, MambaTAD achieves 74.3% on THUMOS14 using 46.7 M parameters and 1.47 T FLOPs, outperforming AdaTAD’s 73.5% with 67.2 M parameters and 1.53 T FLOPs.

Benchmark Results

THUMOS14 (non-E2E, InternVideo6B): MambaTAD [email protected]/0.5/0.7 = 87.5/78.3/52.9; Avg mAP = 73.9.
ActivityNet-1.3 (non-E2E, InternVideo6B): MambaTAD [email protected]/0.75/0.95 = 63.1/44.2/11.0; Avg mAP = 42.8.
MultiThumos (I3D): MambaTAD 35.9% vs 35.6% for ADSFormer.
HACS-Segment: MambaTAD SOTA 44.9% mAP (prior VideoMambaSuite 44.5%).
FineAction: SOTA 29.4% vs 29.0% prior.

Ablation studies indicate that both DMBSS and the global fusion head are critical to full system performance. Best results arise with diagonal masking, dual-branch, and bidirectional parameter-sharing enabled.

6. Experimental Setup and Training Protocols

Datasets: THUMOS14, ActivityNet-1.3, MultiThumos, HACS-Segment, FineAction.
Metrics: Mean Average Precision (mAP) at various IoU thresholds.
Backbones: I3D, R(2+1)D, InternVideo-6B, VideoMAE-{S, B, L, H, G}.
Optimization: AdamW; learning rates adjusted per dataset; batch sizes tuned per available GPU. In end-to-end mode, only SSTA blocks are trained.
Qualitative Observations: MambaTAD successfully captures slow-motion replays, maintains robustness under occlusion, delivers tight action boundary regression, and resolves long/multi-instance actions which challenge competing methods.

7. Limitations and Future Directions

MambaTAD retains several open challenges:

Extra-long action modeling: Actions exceeding 18 seconds remain problematic; adaptively scaling state-space parameters may ameliorate such cases.
Adaptive masking and attention: Dynamic masking schemes or integration of lightweight causal attention may further enhance SSM-based modeling.
Multi-modal extension: Incorporation of optical flow, audio, or other modalities remains unexplored in the framework as published.

Potential improvement avenues include per-instance parameter adaptation and hybrid temporal modeling mechanisms, as well as generalization to additional data types and modalities.

MambaTAD advances temporal action detection by resolving fundamental limitations in state-space-based long-range modeling, efficiently fusing pyramid features, and affording practical end-to-end fine-tuning using a compact adapter scheme. The approach establishes new performance baselines on five challenging TAD datasets, frequently with lower parameter and compute budgets than prior methods (Lu et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection (2025)

MambaTAD: Efficient Temporal Action Detection

1. Temporal Action Detection: Problem Formulation and Challenges

2. MambaTAD Architecture and Core Modules

3. Diagonal-Masked Bidirectional State-Space (DMBSS) Design

4. State-Space Temporal Adapter (SSTA) and Global Feature Fusion

State-Space Temporal Adapter (SSTA)

Global Feature Fusion Head

5. Computational and Empirical Performance

Benchmark Results

6. Experimental Setup and Training Protocols

7. Limitations and Future Directions

Whiteboard

Follow Topic

Continue Learning

MambaTAD: Efficient Temporal Action Detection

1. Temporal Action Detection: Problem Formulation and Challenges

2. MambaTAD Architecture and Core Modules

3. Diagonal-Masked Bidirectional State-Space (DMBSS) Design

4. State-Space Temporal Adapter (SSTA) and Global Feature Fusion

State-Space Temporal Adapter (SSTA)

Global Feature Fusion Head

5. Computational and Empirical Performance

Benchmark Results

6. Experimental Setup and Training Protocols

7. Limitations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics