Diagonal-Masked Bidirectional State-Space (DMBSS)

Updated 26 November 2025

DMBSS is a bidirectional state-space module that applies a diagonal-masking scheme to suppress self-connections and sharpen inter-frame dependencies.
The architecture integrates dual SSM scans with forward and backward streams to mitigate temporal context decay, yielding measurable performance gains on TAD benchmarks.
Its linear computational complexity and dual-branch design make it ideal for effectively modeling long-range dependencies in untrimmed video sequences.

The Diagonal-Masked Bidirectional State-Space (DMBSS) module is an architectural component designed to address long-range temporal context modeling and conflict resolution in sequence processing, as exemplified by its integration into the MambaTAD framework for Temporal Action Detection (TAD). DMBSS leverages bidirectional structured state-space models (SSMs) with a diagonal-masking scheme to prevent self-correlation inflation and preserve sharp inter-token relationships, achieving linear computational complexity and improved empirical performance across TAD benchmarks (Lu et al., 22 Nov 2025).

1. Motivation and Problem Formulation

Structured State-Space Models (SSMs) such as S4 and Mamba are capable of causal sequence modeling using lower-triangular "convolution" matrices that restrict each output $y_t$ to access only inputs up to $x_t$ . In TAD, two principal issues with such models are identified: (1) temporal context decay due to deep recursion and strictly causal filtering, leading to inadequate long-range context for determining action boundaries, and (2) self-element conflicts when fusing forward and backward passes, which results in overemphasized diagonal (self-attention) entries and weakened inter-frame dependencies. DMBSS was introduced to resolve these shortcomings by (a) restoring full bidirectional global context, and (b) suppressing diagonal contributions in the backward scan, thus sharpening the model’s sensitivity to temporal boundaries and long-span dependencies (Lu et al., 22 Nov 2025).

2. Architectural Design and Dataflow

Within each application in MambaTAD, DMBSS operates on a feature tensor $X \in \mathbb{R}^{B \times S \times C}$ (batch $B$ , sequence length $S$ , channels $C$ ). The core dataflow per DMBSS block is as follows:

Normalization and Expansion: Apply LayerNorm and a linear projection to expand $X$ from $C$ to $4C$ channels.
Dual Branches: Resulting $4C$-channel tensor is split into two independent $2C$-channel branches for parameter diversity. Each branch further splits into 'forward' ( $x_{\text{fw}}$ ) and 'backward' ( $x_{\text{bw}}$ ) $C$ -channel streams, with $x_{\text{bw}}$ temporally reversed.
State Matrix Parameterization: Shared learnable state-transition parameter $A \in \mathbb{R}^{C \times D}$ per branch is chunked into $A_{\text{fw}}$ and $A_{\text{bw}}$ . The diagonal of $A_{\text{bw}}$ is masked to zero.
Bidirectional SSM Processing: Forward and backward streams are processed by "vanilla Mamba" selective scan SSMs, omitting the final linear projection for modularity. Reverse the backward output post-processing to original time order.
Identity Pathway and Gating: A third linear "identity" pathway operates in parallel. The outputs of the SSMs and identity are fused by elementwise product gating and concatenation.
Final Integration: A linear layer restores the output dimension to $C$ channels, and a residual connection with the input $X$ is added.
Parallelization: The entire block is duplicated into two non-parameter-sharing branches for improved robustness against temporal context decay.

3. Mathematical Formalism

3.1. Discrete State-Space Model

The continuous SSM is defined as:

$h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)$

Discretization via zero-order hold with step $\Delta$ yields:

$\bar{A} = \exp(\Delta A), \qquad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A)-I)\Delta B$

with iteration:

$h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t$

The corresponding convolution kernel is:

$K = [C\bar{B},\, C\bar{A}\bar{B},\, \ldots,\, C\bar{A}^{S-1}\bar{B}] \in \mathbb{R}^S;\quad y = x * K$

3.2. Bidirectional Scan and Diagonal Mask

Let $J_S$ be the temporal reversal matrix. The forward and backward SSM outputs are:

$Y_{\text{fw}} = M_{\text{fw}} X,\qquad Y_{\text{bw}} = J_S (M_{\text{bw}}(J_S X))$

Naive summation yields combined matrix $M'$ with diagonal

$M'_{ii} = (M_{\text{fw}})_{ii} + (M_{\text{bw}})_{ii}$

The DMBSS introduces a mask

$\text{Mask}_{ij} = \begin{cases} 0, & i = j \ 1, & i \neq j \end{cases}$

applied to $A_{\text{bw}}$ , enforcing $A_{\text{bw}}^{(\text{masked})} = \text{Mask} \odot A_{\text{bw}}$ , nullifying backward self-connections.

3.3. Gating and Fusion

For each branch, let $X_1$ (forward SSM), $X_2$ (backward SSM), $X_3$ (identity path):

$U = X_1 \odot X_3,\qquad V = X_2 \odot X_3,\qquad \text{out} = \text{Concat}(U, V)$

A linear projection and residual addition finalize the output.

4. Pseudocode Implementation

A simplified pseudocode for a single dual-branch DMBSS block:

Z = Linear1(LN(X_in))        # ℝ^{B×S×4C}
[X_a, X_b] = Chunk(Z, 2)     # ℝ^{B×S×2C} each

[x_fw, x_bw] = Chunk(X_a, 2)
x_bw = Flip(x_bw, time)

A = –exp(Param_A)            # ℝ^{C×D}
[A_fw, A_bw] = Chunk(A, 2)
A_bw = A_bw ⊙ (1 – I_C)      # Remove diagonal

Z_fw = SSM(x_fw; A_fw)       # ℝ^{B×S×C}
Z_bw = Flip(SSM(x_bw; A_bw), time)

X_id = Linear2(X_in)         # ℝ^{B×S×C}
U = Z_fw ⊙ X_id
V = Z_bw ⊙ X_id
Out = Linear3(Concat(U, V)) + X_in   # residual

return Out

Both branches execute this logic independently; their results are concatenated and projected to yield the final module output.

5. Computational Complexity Analysis

Each SSM scan executes in $O(S \cdot D)$ time per sequence, preserving linear scaling in temporal length. DMBSS requires two scans per branch and two branches, yielding an aggregate complexity of approximately $4 \cdot O(S \cdot D)$ . All other operations (LayerNorm, Linear projection, elementwise fusion) are $O(S \cdot C)$ . Therefore, the end-to-end complexity is $O(S \cdot \max(C, D))$ . Memory overhead is modest, with principal auxiliary state being $A \in \mathbb{R}^{C\times D}$ per branch, significantly reduced compared to attention’s $O(S^2)$ scaling.

6. Integration within MambaTAD Architecture

DMBSS is used throughout the MambaTAD pipeline:

Feature Extraction: Backbone produces per-frame feature tensors.
State-Space Temporal Adapter (SSTA): Lightweight DMBSS-based adapters at each backbone layer enhance context with minimal parameter count.
Projection Pyramid: Multiple DMBSS blocks plus max-pooling stack to form a temporal feature pyramid, capturing multi-scale action signals.
Global Feature Fusion Head: Features from all pyramid levels are concatenated and processed by a final DMBSS, serving as input for classification and regression heads.

At every stage, DMBSS replaces conventional attention or CNN modules, delivering both bidirectional context and linear complexity suitable for long untrimmed videos in TAD.

7. Empirical Evaluation and Comparative Insights

Major empirical observations from MambaTAD integration:

Switching from vanilla Mamba to DMBSS yields a +0.9% average mAP boost on THUMOS14.
The diagonal mask on the backward SSM results in an extra ~0.5% mAP versus unmasked bidirectional Mamba.
Dual-branch design (non-shared parameters) is more effective than a single shared-parameter variant, indicating improved mitigation of context decay.
Full end-to-end deployment (SSTA+DMBSS) secures a +1.9% mAP increment over an ActionFormer-based baseline.
Comparative studies show DMBSS outperforming other SSM architectures (Mamba2, Hydra, DBM/CausalTAD) in terms of mAP, specifically through effective handling of temporal context decay and diagonal conflicts.
High robustness is maintained for long action segments (Coverage $>$ 8%, Length $>$ 18s), where alternative models typically degrade (Lu et al., 22 Nov 2025).

DMBSS emerges as a foundational bidirectional SSM block for long-range temporal modeling, facilitating linear-time global context propagation, refined fusion of forward and backward dependencies, and generalization to extended action instances in video analysis.

PDF Markdown Chat (Pro)

References (1)

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Diagonal-Masked Bidirectional State-Space (DMBSS).