Dual Branch VideoMamba: Efficient Violence Detection

Updated 9 March 2026

Dual Branch VideoMamba is a state-space model–based deep neural architecture that employs dual branches to separately capture spatial structures and temporal dependencies.
The design integrates a cropping module with parallel SSM branches and a learnable Gated Class Token Fusion that enhances cross-branch communication for optimized feature fusion.
Empirical evaluations demonstrate state-of-the-art violence detection accuracy with significantly reduced computational complexity compared to CNN and Transformer models.

Dual Branch VideoMamba is a state-space model–based deep neural architecture tailored for video sequence understanding, with a particular focus on efficient and accurate violence detection in real-world surveillance settings. The core innovation is a dual-branch processing and fusion mechanism: one branch is designed to prioritize spatial context, while the other is optimized for temporal dependencies. Both branches leverage the Mamba Selective State-Space Model (SSM), enabling linear-time sequence modeling and scalable training, and are fused through a learnable Gated Class Token Fusion (GCTF) layer at every depth. This approach achieves state-of-the-art results on consolidated violence detection benchmarks with significantly reduced computational complexity relative to CNN- and Transformer-based baselines (Senadeera et al., 23 May 2025).

1. Architectural Principles

The Dual Branch VideoMamba architecture comprises four principal stages: an initial Cropping Module, parallel VideoMamba SSM branches, recurrent GCTF modules at every encoding layer, and a final classification head.

Cropping Module: The input, a raw video tensor $X\in\mathbb{R}^{3\times T\times H\times W}$ , undergoes frame-wise human detection using YOLO V8. The tightest bounding box enclosing detected persons is computed, and the spatial crop is extracted; if no detections are present, the full frame is retained.
Branch-1 (Spatial-First Scanning): Tokenization is implemented via a 3D convolution with kernel $(1\times16\times16)$ and stride $(1,16,16)$ , generating $L=t\times h\times w$ patch embeddings ( $\mathbf{X}^p$ ). A learnable class token ( $\mathrm{CLS}^1$ ) is prepended. Tokens are organized spatially within each frame and concatenated over time, enabling the branch to emphasize spatial structures (object shapes, static configurations).
Branch-2 (Temporal-First Scanning): Patch embeddings and class token ( $\mathrm{CLS}^2$ ) are generated similarly. However, tokens are ordered by frame index, so spatial locations are stacked temporally. This facilitates modeling of motion, action progression, and long-range dependencies.
Gated Class Token Fusion (GCTF): At each encoding depth, fusion is performed between the class tokens of both branches using a signal-dependent gate. The fused token is injected as an auxiliary feature into Branch-2’s next layer, enabling continuous cross-branch communication and context alignment.
Final Fusion and Prediction: The two terminal class tokens, $\mathrm{CLS}_{\mathrm{fin}}^1$ and $\mathrm{CLS}_{\mathrm{fin}}^2$ , are concatenated and fed to a linear head with softmax activation to yield the binary violence classification.

2. Mathematical Foundation

All feature extraction in both branches is underpinned by the VideoMamba SSM, specifically its Selective Scan (S6) operator. The discretization of a continuous-time dynamical system,

$\frac{d h(t)}{dt} = A\,h(t) + B\,x(t),\qquad y(t)=C\,h(t),$

is achieved via zero-order hold, yielding iterative updates: $h_t = \overline{A}\,h_{t-1} + \overline{B}\,x_t,\quad y_t = C\,h_t,$ where $\overline{A} = \exp(\Delta A)$ , $\overline{B} = (\Delta A)^{-1}(\exp(\Delta A)-I) \Delta B$ , and $x_t$ is the input token. The S6 blocks dynamically generate parameters $B, C, \Delta$ from the input at each step, increasing adaptability.

The GCTF at layer $l$ operates as follows,

$\sigma'_l = \mathrm{Sigmoid}(\sigma_l)\,,\qquad \mathrm{CLS}_{\mathrm{fused},\,l} = \sigma'_l \odot \mathrm{CLS}^2_l + (1-\sigma'_l)\odot \mathrm{CLS}^1_l,$

where $\sigma_l$ is a learnable vector, and $\odot$ indicates element-wise multiplication. The output is continuously refined across depths by reusing the fused class token in Branch-2.

3. Consolidated Violence Detection Benchmark

To ensure generalization and unbiased evaluation, three leading datasets (RWF-2000, RLVS, and VioPeru) are merged. Particular steps are taken to prevent data leakage via cross-split embedding similarity analysis and manual duplicate removal, resulting in 1,712 violent and 1,712 non-violent training clips, and 427 violent and 428 non-violent testing clips, each labeled with a single binary class. Each video clip is standardized to five seconds in length (Senadeera et al., 23 May 2025).

4. Empirical Performance and Comparative Efficiency

Dual Branch VideoMamba sets new state-of-the-art results under the consolidated benchmark. The following table summarizes its performance and efficiency relative to alternative architectures:

Model	Params (M)	FLOPS (G)	Top-1 Acc (%)	F1-Viol (%)	F1-Non-Viol (%)
SlowFast	60	234	76.37	74.56	77.95
VideoSwin-B	88	281.6	80.47	79.15	81.63
Uniformer-V2	354	6108	92.51	92.87	92.12
VideoMamba	74	806	94.39	94.29	94.48
CUE-Net	354	5826	95.91	95.95	95.86
Dual Branch VB	154.3	1612	96.37	96.33	96.42

Dual Branch VideoMamba achieves the highest balanced accuracy and F1 scores, with ~50% lower parameter count and FLOPs than prior SOTA models such as CUE-Net and Uniformer (Senadeera et al., 23 May 2025).

5. Implementation Specifics

The implementation uses PyTorch. Both branches run in parallel, initialized from pretrained VideoMamba-Middle weights on Kinetics-400, each with 32 layers and hidden dimension $d=576$ . The optimizer is AdamW with an initial learning rate of $1 \times 10^{-4}$ , a cosine-annealing learning rate schedule, and five-epoch warmup over 55 epochs total. Cross-entropy loss on binary labels is standard; spatial input augmentation consists solely of YOLO V8 cropping and frame resizing to $224 \times 224$ . Ablations confirm gains of +0.98% via the cropping module, ~0.7% from continuous (full-depth) GCTF versus single-depth fusion, and optimal results with balanced input frame counts between branches (Senadeera et al., 23 May 2025).

6. Comparative Context and Impact

The use of state-space models in video understanding has been motivated by the limitations of traditional self-attention in computational scaling, especially for long-term temporal dependencies (Park et al., 2024). Linear-time selective SSMs enable tractable training and inference on longer sequences. Dual-branch or dual-path mechanisms have also been explored in variants such as VSRM and MambaSeg, where parallel SSMs or hybrid SSM/Transformer blocks facilitate effective spatio-temporal and multimodal fusion (Tran et al., 28 Jun 2025, Gu et al., 30 Dec 2025). However, the Dual Branch VideoMamba’s particular innovation lies in its gated, depth-wise class token fusion and the systematic separation of spatial and temporal scan ordering, which jointly maximize representational synergy across branches.

The model’s significant increase in efficiency and accuracy and its robust performance in real-time surveillance scenarios suggest Mamba SSMs are poised to replace quadratic-time Transformer backbones in a variety of video analytics applications (Senadeera et al., 23 May 2025). The dual-branch framework is extensible to tasks beyond violence detection, including video segmentation, anomaly detection, and multimodal event analysis, contingent on suitable scan orderings and fusion schemes.