Diagonal-Masked Bidirectional State-Space (DMBSS)
- DMBSS is a bidirectional state-space module that applies a diagonal-masking scheme to suppress self-connections and sharpen inter-frame dependencies.
- The architecture integrates dual SSM scans with forward and backward streams to mitigate temporal context decay, yielding measurable performance gains on TAD benchmarks.
- Its linear computational complexity and dual-branch design make it ideal for effectively modeling long-range dependencies in untrimmed video sequences.
The Diagonal-Masked Bidirectional State-Space (DMBSS) module is an architectural component designed to address long-range temporal context modeling and conflict resolution in sequence processing, as exemplified by its integration into the MambaTAD framework for Temporal Action Detection (TAD). DMBSS leverages bidirectional structured state-space models (SSMs) with a diagonal-masking scheme to prevent self-correlation inflation and preserve sharp inter-token relationships, achieving linear computational complexity and improved empirical performance across TAD benchmarks (Lu et al., 22 Nov 2025).
1. Motivation and Problem Formulation
Structured State-Space Models (SSMs) such as S4 and Mamba are capable of causal sequence modeling using lower-triangular "convolution" matrices that restrict each output to access only inputs up to . In TAD, two principal issues with such models are identified: (1) temporal context decay due to deep recursion and strictly causal filtering, leading to inadequate long-range context for determining action boundaries, and (2) self-element conflicts when fusing forward and backward passes, which results in overemphasized diagonal (self-attention) entries and weakened inter-frame dependencies. DMBSS was introduced to resolve these shortcomings by (a) restoring full bidirectional global context, and (b) suppressing diagonal contributions in the backward scan, thus sharpening the model’s sensitivity to temporal boundaries and long-span dependencies (Lu et al., 22 Nov 2025).
2. Architectural Design and Dataflow
Within each application in MambaTAD, DMBSS operates on a feature tensor (batch , sequence length , channels ). The core dataflow per DMBSS block is as follows:
- Normalization and Expansion: Apply LayerNorm and a linear projection to expand from to $4C$ channels.
- Dual Branches: Resulting $4C$-channel tensor is split into two independent $2C$-channel branches for parameter diversity. Each branch further splits into 'forward' () and 'backward' () -channel streams, with temporally reversed.
- State Matrix Parameterization: Shared learnable state-transition parameter per branch is chunked into and . The diagonal of is masked to zero.
- Bidirectional SSM Processing: Forward and backward streams are processed by "vanilla Mamba" selective scan SSMs, omitting the final linear projection for modularity. Reverse the backward output post-processing to original time order.
- Identity Pathway and Gating: A third linear "identity" pathway operates in parallel. The outputs of the SSMs and identity are fused by elementwise product gating and concatenation.
- Final Integration: A linear layer restores the output dimension to channels, and a residual connection with the input is added.
- Parallelization: The entire block is duplicated into two non-parameter-sharing branches for improved robustness against temporal context decay.
3. Mathematical Formalism
3.1. Discrete State-Space Model
The continuous SSM is defined as:
Discretization via zero-order hold with step yields:
with iteration:
The corresponding convolution kernel is:
3.2. Bidirectional Scan and Diagonal Mask
Let be the temporal reversal matrix. The forward and backward SSM outputs are:
Naive summation yields combined matrix with diagonal
The DMBSS introduces a mask
applied to , enforcing , nullifying backward self-connections.
3.3. Gating and Fusion
For each branch, let (forward SSM), (backward SSM), (identity path):
A linear projection and residual addition finalize the output.
4. Pseudocode Implementation
A simplified pseudocode for a single dual-branch DMBSS block:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Z = Linear1(LN(X_in)) # ℝ^{B×S×4C} [X_a, X_b] = Chunk(Z, 2) # ℝ^{B×S×2C} each [x_fw, x_bw] = Chunk(X_a, 2) x_bw = Flip(x_bw, time) A = –exp(Param_A) # ℝ^{C×D} [A_fw, A_bw] = Chunk(A, 2) A_bw = A_bw ⊙ (1 – I_C) # Remove diagonal Z_fw = SSM(x_fw; A_fw) # ℝ^{B×S×C} Z_bw = Flip(SSM(x_bw; A_bw), time) X_id = Linear2(X_in) # ℝ^{B×S×C} U = Z_fw ⊙ X_id V = Z_bw ⊙ X_id Out = Linear3(Concat(U, V)) + X_in # residual return Out |
5. Computational Complexity Analysis
Each SSM scan executes in time per sequence, preserving linear scaling in temporal length. DMBSS requires two scans per branch and two branches, yielding an aggregate complexity of approximately . All other operations (LayerNorm, Linear projection, elementwise fusion) are . Therefore, the end-to-end complexity is . Memory overhead is modest, with principal auxiliary state being per branch, significantly reduced compared to attention’s scaling.
6. Integration within MambaTAD Architecture
DMBSS is used throughout the MambaTAD pipeline:
- Feature Extraction: Backbone produces per-frame feature tensors.
- State-Space Temporal Adapter (SSTA): Lightweight DMBSS-based adapters at each backbone layer enhance context with minimal parameter count.
- Projection Pyramid: Multiple DMBSS blocks plus max-pooling stack to form a temporal feature pyramid, capturing multi-scale action signals.
- Global Feature Fusion Head: Features from all pyramid levels are concatenated and processed by a final DMBSS, serving as input for classification and regression heads.
At every stage, DMBSS replaces conventional attention or CNN modules, delivering both bidirectional context and linear complexity suitable for long untrimmed videos in TAD.
7. Empirical Evaluation and Comparative Insights
Major empirical observations from MambaTAD integration:
- Switching from vanilla Mamba to DMBSS yields a +0.9% average mAP boost on THUMOS14.
- The diagonal mask on the backward SSM results in an extra ~0.5% mAP versus unmasked bidirectional Mamba.
- Dual-branch design (non-shared parameters) is more effective than a single shared-parameter variant, indicating improved mitigation of context decay.
- Full end-to-end deployment (SSTA+DMBSS) secures a +1.9% mAP increment over an ActionFormer-based baseline.
- Comparative studies show DMBSS outperforming other SSM architectures (Mamba2, Hydra, DBM/CausalTAD) in terms of mAP, specifically through effective handling of temporal context decay and diagonal conflicts.
- High robustness is maintained for long action segments (Coverage 8%, Length 18s), where alternative models typically degrade (Lu et al., 22 Nov 2025).
DMBSS emerges as a foundational bidirectional SSM block for long-range temporal modeling, facilitating linear-time global context propagation, refined fusion of forward and backward dependencies, and generalization to extended action instances in video analysis.