Papers
Topics
Authors
Recent
2000 character limit reached

Diagonal-Masked Bidirectional State-Space (DMBSS)

Updated 26 November 2025
  • DMBSS is a bidirectional state-space module that applies a diagonal-masking scheme to suppress self-connections and sharpen inter-frame dependencies.
  • The architecture integrates dual SSM scans with forward and backward streams to mitigate temporal context decay, yielding measurable performance gains on TAD benchmarks.
  • Its linear computational complexity and dual-branch design make it ideal for effectively modeling long-range dependencies in untrimmed video sequences.

The Diagonal-Masked Bidirectional State-Space (DMBSS) module is an architectural component designed to address long-range temporal context modeling and conflict resolution in sequence processing, as exemplified by its integration into the MambaTAD framework for Temporal Action Detection (TAD). DMBSS leverages bidirectional structured state-space models (SSMs) with a diagonal-masking scheme to prevent self-correlation inflation and preserve sharp inter-token relationships, achieving linear computational complexity and improved empirical performance across TAD benchmarks (Lu et al., 22 Nov 2025).

1. Motivation and Problem Formulation

Structured State-Space Models (SSMs) such as S4 and Mamba are capable of causal sequence modeling using lower-triangular "convolution" matrices that restrict each output yty_t to access only inputs up to xtx_t. In TAD, two principal issues with such models are identified: (1) temporal context decay due to deep recursion and strictly causal filtering, leading to inadequate long-range context for determining action boundaries, and (2) self-element conflicts when fusing forward and backward passes, which results in overemphasized diagonal (self-attention) entries and weakened inter-frame dependencies. DMBSS was introduced to resolve these shortcomings by (a) restoring full bidirectional global context, and (b) suppressing diagonal contributions in the backward scan, thus sharpening the model’s sensitivity to temporal boundaries and long-span dependencies (Lu et al., 22 Nov 2025).

2. Architectural Design and Dataflow

Within each application in MambaTAD, DMBSS operates on a feature tensor XRB×S×CX \in \mathbb{R}^{B \times S \times C} (batch BB, sequence length SS, channels CC). The core dataflow per DMBSS block is as follows:

  • Normalization and Expansion: Apply LayerNorm and a linear projection to expand XX from CC to $4C$ channels.
  • Dual Branches: Resulting $4C$-channel tensor is split into two independent $2C$-channel branches for parameter diversity. Each branch further splits into 'forward' (xfwx_{\text{fw}}) and 'backward' (xbwx_{\text{bw}}) CC-channel streams, with xbwx_{\text{bw}} temporally reversed.
  • State Matrix Parameterization: Shared learnable state-transition parameter ARC×DA \in \mathbb{R}^{C \times D} per branch is chunked into AfwA_{\text{fw}} and AbwA_{\text{bw}}. The diagonal of AbwA_{\text{bw}} is masked to zero.
  • Bidirectional SSM Processing: Forward and backward streams are processed by "vanilla Mamba" selective scan SSMs, omitting the final linear projection for modularity. Reverse the backward output post-processing to original time order.
  • Identity Pathway and Gating: A third linear "identity" pathway operates in parallel. The outputs of the SSMs and identity are fused by elementwise product gating and concatenation.
  • Final Integration: A linear layer restores the output dimension to CC channels, and a residual connection with the input XX is added.
  • Parallelization: The entire block is duplicated into two non-parameter-sharing branches for improved robustness against temporal context decay.

3. Mathematical Formalism

3.1. Discrete State-Space Model

The continuous SSM is defined as:

h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A h(t) + B x(t), \qquad y(t) = C h(t)

Discretization via zero-order hold with step Δ\Delta yields:

Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{A} = \exp(\Delta A), \qquad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A)-I)\Delta B

with iteration:

ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t

The corresponding convolution kernel is:

K=[CBˉ,CAˉBˉ,,CAˉS1Bˉ]RS;y=xKK = [C\bar{B},\, C\bar{A}\bar{B},\, \ldots,\, C\bar{A}^{S-1}\bar{B}] \in \mathbb{R}^S;\quad y = x * K

3.2. Bidirectional Scan and Diagonal Mask

Let JSJ_S be the temporal reversal matrix. The forward and backward SSM outputs are:

Yfw=MfwX,Ybw=JS(Mbw(JSX))Y_{\text{fw}} = M_{\text{fw}} X,\qquad Y_{\text{bw}} = J_S (M_{\text{bw}}(J_S X))

Naive summation yields combined matrix MM' with diagonal

Mii=(Mfw)ii+(Mbw)iiM'_{ii} = (M_{\text{fw}})_{ii} + (M_{\text{bw}})_{ii}

The DMBSS introduces a mask

Maskij={0,i=j 1,ij\text{Mask}_{ij} = \begin{cases} 0, & i = j \ 1, & i \neq j \end{cases}

applied to AbwA_{\text{bw}}, enforcing Abw(masked)=MaskAbwA_{\text{bw}}^{(\text{masked})} = \text{Mask} \odot A_{\text{bw}}, nullifying backward self-connections.

3.3. Gating and Fusion

For each branch, let X1X_1 (forward SSM), X2X_2 (backward SSM), X3X_3 (identity path):

U=X1X3,V=X2X3,out=Concat(U,V)U = X_1 \odot X_3,\qquad V = X_2 \odot X_3,\qquad \text{out} = \text{Concat}(U, V)

A linear projection and residual addition finalize the output.

4. Pseudocode Implementation

A simplified pseudocode for a single dual-branch DMBSS block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Z = Linear1(LN(X_in))        # ℝ^{B×S×4C}
[X_a, X_b] = Chunk(Z, 2)     # ℝ^{B×S×2C} each

[x_fw, x_bw] = Chunk(X_a, 2)
x_bw = Flip(x_bw, time)

A = exp(Param_A)            # ℝ^{C×D}
[A_fw, A_bw] = Chunk(A, 2)
A_bw = A_bw  (1  I_C)      # Remove diagonal

Z_fw = SSM(x_fw; A_fw)       # ℝ^{B×S×C}
Z_bw = Flip(SSM(x_bw; A_bw), time)

X_id = Linear2(X_in)         # ℝ^{B×S×C}
U = Z_fw  X_id
V = Z_bw  X_id
Out = Linear3(Concat(U, V)) + X_in   # residual

return Out
Both branches execute this logic independently; their results are concatenated and projected to yield the final module output.

5. Computational Complexity Analysis

Each SSM scan executes in O(SD)O(S \cdot D) time per sequence, preserving linear scaling in temporal length. DMBSS requires two scans per branch and two branches, yielding an aggregate complexity of approximately 4O(SD)4 \cdot O(S \cdot D). All other operations (LayerNorm, Linear projection, elementwise fusion) are O(SC)O(S \cdot C). Therefore, the end-to-end complexity is O(Smax(C,D))O(S \cdot \max(C, D)). Memory overhead is modest, with principal auxiliary state being ARC×DA \in \mathbb{R}^{C\times D} per branch, significantly reduced compared to attention’s O(S2)O(S^2) scaling.

6. Integration within MambaTAD Architecture

DMBSS is used throughout the MambaTAD pipeline:

  • Feature Extraction: Backbone produces per-frame feature tensors.
  • State-Space Temporal Adapter (SSTA): Lightweight DMBSS-based adapters at each backbone layer enhance context with minimal parameter count.
  • Projection Pyramid: Multiple DMBSS blocks plus max-pooling stack to form a temporal feature pyramid, capturing multi-scale action signals.
  • Global Feature Fusion Head: Features from all pyramid levels are concatenated and processed by a final DMBSS, serving as input for classification and regression heads.

At every stage, DMBSS replaces conventional attention or CNN modules, delivering both bidirectional context and linear complexity suitable for long untrimmed videos in TAD.

7. Empirical Evaluation and Comparative Insights

Major empirical observations from MambaTAD integration:

  • Switching from vanilla Mamba to DMBSS yields a +0.9% average mAP boost on THUMOS14.
  • The diagonal mask on the backward SSM results in an extra ~0.5% mAP versus unmasked bidirectional Mamba.
  • Dual-branch design (non-shared parameters) is more effective than a single shared-parameter variant, indicating improved mitigation of context decay.
  • Full end-to-end deployment (SSTA+DMBSS) secures a +1.9% mAP increment over an ActionFormer-based baseline.
  • Comparative studies show DMBSS outperforming other SSM architectures (Mamba2, Hydra, DBM/CausalTAD) in terms of mAP, specifically through effective handling of temporal context decay and diagonal conflicts.
  • High robustness is maintained for long action segments (Coverage >> 8%, Length >> 18s), where alternative models typically degrade (Lu et al., 22 Nov 2025).

DMBSS emerges as a foundational bidirectional SSM block for long-range temporal modeling, facilitating linear-time global context propagation, refined fusion of forward and backward dependencies, and generalization to extended action instances in video analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diagonal-Masked Bidirectional State-Space (DMBSS).