Bidirectional Selective SSM Layers

Updated 26 April 2026

Bidirectional selective SSM layers are neural sequence modeling blocks that fuse input-adaptive state transitions with forward and backward processing to capture rich contextual insights.
They maintain linear O(L) complexity while replacing quadratic self-attention, making them effective for diverse applications including vision, speech, and graph analysis.
Empirical studies demonstrate significant performance gains in tasks such as 3D pose estimation, point cloud analysis, and speech recognition with enhanced efficiency and accuracy.

Bidirectional Selective State Space Model (SSM) layers constitute a recent class of neural sequence modeling blocks that achieve efficient, context-rich feature extraction by integrating linear-complexity state space recurrences with bidirectional (forward and reverse) processing, and—crucially—by making state transitions input-selective and/or spatially structured. These layers generalize unidirectional SSMs, such as Mamba, to bidirectional, context-aggregating modules capable of replacing self-attention in domains including vision, language, point clouds, speech, and graphs, while maintaining strict $O(L)$ sequence scaling and achieving state-of-the-art results across tasks. The following sections present a rigorous overview of mathematical definitions, architectural instantiations, bidirectionality mechanisms, complexity profiles, and empirical impacts as seen in recent literature.

1. Mathematical Foundation of Selective SSMs

At their core, selective SSM layers discretize continuous-time linear state-space models: $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ where $h(t) \in \mathbb{C}^N$ is the hidden state, $u(t)$ is the input, with $A, B, C, D$ learned or parameterized matrices. Discretizing using a step size $\Delta$ yields

$h_k = \overline{A} h_{k-1} + \overline{B} u_{k-1}, \qquad y_k = C h_k + D u_k$

with $\overline{A} = e^{A \Delta}$ , $\overline{B} = \int_0^\Delta e^{A \tau} B d\tau$ . The defining innovation of Mamba and successors is selectivity: the SSM parameters ( $\overline{B}_k, C_k, D_k$ ) become functions of each token embedding, typically realized via learned projections or small neural networks. Thus, each time step can adapt memory updates and outputs to local content (Huang et al., 2024, Masuyama et al., 2024, Qu et al., 11 Nov 2025, Jiang et al., 2024, Behrouz et al., 2024).

2. Achieving Bidirectionality

Bidirectionality is realized by applying the SSM scan both in the canonical and reverse order over the token sequence. For a sequence $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 0: $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 1 with shared SSM parameters for both passes. Outputs are fused, typically by elementwise addition, concatenation, or residual summation: $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 2 This gives each position access to both past and future context, substantially improving long-range modeling (e.g., for 3D pose, speech, point cloud geometry, graph motifs). In some cases, such as MADEON for ASR, only a prefix segment (e.g., speech tokens) receives bidirectional SSM processing (Masuyama et al., 2024).

Specialized fusions arise, e.g., the “chainedMamba” of CloudMamba, where the backward scan is applied to the output of the forward scan (i.e., backward on high-level features), increasing context mixing while preserving causality in each direction (Qu et al., 11 Nov 2025).

3. Structured and Selective Gating Mechanisms

Selectivity is achieved by making the SSM's state transitions (and optionally the state update gates) input-dependent. For each token $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 3, selection masks or parameters such as $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 4 modulate the recurrence: $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 5 as used in e.g., HSIDMamba (Liu et al., 2024). Analogously, pointwise gates (sigmoid-activated) modulate contributions of each directional output (forward/backward) or of individual state components, allowing the layer to suppress or amplify information flow depending on spatial, spectral, or semantic context (Jiang et al., 2024, Chen et al., 2024). In bidirectional networks for graphs and speech, gating is often coupled with nonlinearities (e.g., SiLU or softplus) and small MLPs.

In event-based eye tracking (MambaPupil), the time-varying selection mechanism is integrated into input-adaptive SSM parameter construction, leading to context-dependent gating and significantly improved stability for abrupt or ambiguous events (Wang et al., 2024).

4. Architectural Design: Spatial, Temporal, and Domain-Specific Structure

Modern bidirectional selective SSM blocks often integrate both global and local context, for instance:

Global-local splits: In PoseMamba (Huang et al., 2024), bidirectional SSMs model joint tokens both in global (skeleton-wide) and local (limb-centric, reordered geometrically) scan orders per frame. The local scan uses skeleton-driven reordering (e.g., spine $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 6left arm $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 7\ldots) to enforce anatomical adjacency, leading to enhanced limb-wise correlation modeling.
Dual-path and multi-axis fusion: CloudMamba creates three sequences per axis (X/Y/Z) via sorting, runs chained bidirectional SSM on each, and merges the features, capturing rich 3D geometry without causing order confusion—even in unordered point clouds (Qu et al., 11 Nov 2025).
Temporal-spatial stacking: Dual-path Mamba for speech separation alternates bidirectional SSMs over intra-chunk (local) and inter-chunk (global) axes, drastically improving separation performance under linear cost (Jiang et al., 2024).
Unique domain conditioning: MADEON reverses and processes only speech tokens bidirectionally within the decoder, leaving text prefix strictly causal (Masuyama et al., 2024). In HSIDMamba, multiple scanning directions (including diagonal and corner-to-corner) are used per spectral block, extending bidirectional context to eight orientations in denoising (Liu et al., 2024).

5. Complexity Analysis

A primary advantage of bidirectional selective SSMs is the preservation of strict $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 8 time and space complexity per sequence (where $\dot h(t) = A h(t) + B u(t), \qquad y(t) = C h(t) + D u(t)$ 9=sequence length, $h(t) \in \mathbb{C}^N$ 0=hidden size), even after introducing bidirectionality. This is in stark contrast to attention-based layers, which scale as $h(t) \in \mathbb{C}^N$ 1 in both compute and memory. The extra cost for bidirectionality is a constant factor—parallel or sequential forward/backward SSM scans. More elaborate designs (e.g., chained bidirectional or multi-path expansions) remain $h(t) \in \mathbb{C}^N$ 2, though with increased constant scaling (Huang et al., 2024, Qu et al., 11 Nov 2025, Jiang et al., 2024, Chen et al., 2024).

6. Empirical Findings and Ablations

Multiple datasets and tasks demonstrate substantial accuracy and efficiency improvements due to bidirectional selective SSMs.

3D Pose Estimation (PoseMamba): Adding the local branch to global bidirectional SSM yields a 0.6 mm MPJPE absolute gain, for a total ≈1.2 mm improvement over unidirectional SSM (Huang et al., 2024).
Point Cloud Analysis: ChainedMamba (bidirectional chained forward/backward) yields a +0.96% OA gain (93.65% $h(t) \in \mathbb{C}^N$ 3 92.69%) over parallel bidirectional, at no extra complexity (Qu et al., 11 Nov 2025). In PointABM, bidirectional SSM layers increase accuracy by 1–1.6 pp across benchmarks (Chen et al., 2024).
Speech and Speech Separation: MADEON achieves ~0.5% WER improvement from bidirectional context (LibriSpeech 100h), and on larger corpora matches or surpasses Transformer decoders with lower GPU memory (Masuyama et al., 2024). Dual-path Mamba outperforms attention- and RNN-based models at a fraction of their cost (Jiang et al., 2024).
Event-based Eye Tracking: Combining Bi-GRU and LTV-SSM (bidirectional SSM) reduces error from 2.77 px (ConvLSTM) to 2.35 px, with improved localization probability (Wang et al., 2024).
Hyperspectral Denoising: Introduction of bidirectional scanning raises PSNR by 0.9–2.6 dB depending on the configuration (Liu et al., 2024).
Graphs: The bidirectional selective SSM encoder is identified as the critical ingredient—ablations removing bidirectionality drop accuracy by 4–5 points on multiple datasets (Behrouz et al., 2024).

Model/Paper	Task/Domain	Empirical Gain of Bidirectional Layer
PoseMamba (Huang et al., 2024)	3D pose estimation	$h(t) \in \mathbb{C}^N$ 4 mm MPJPE (vs. unidirectional)
CloudMamba (Qu et al., 11 Nov 2025)	Point cloud classification	$h(t) \in \mathbb{C}^N$ 5 OA (chain vs parallel)
MADEON (Masuyama et al., 2024)	ASR (LibriSpeech)	$h(t) \in \mathbb{C}^N$ 6 WER (vs. unidirectional)
PointABM (Chen et al., 2024)	Point cloud analysis	$h(t) \in \mathbb{C}^N$ 7– $h(t) \in \mathbb{C}^N$ 8 pp Acc. (vs. unidirectional)
MambaPupil (Wang et al., 2024)	Event eye tracking	Error $h(t) \in \mathbb{C}^N$ 9 px vs $u(t)$ 0 (prior best)
HSIDMamba (Liu et al., 2024)	Hyperspectral denoising	PSNR $u(t)$ 1– $u(t)$ 2 dB
Graph Mamba (Behrouz et al., 2024)	Graph node classification	$u(t)$ 3– $u(t)$ 4 accuracy pts (if removed)

7. Domain-Specific Adaptations and Extensions

Spatial and Anatomical Ordering: Reordering input tokens to follow structural priors (skeleton chains, geometric sorted axes, graph motifs) further enhances local context extraction. This is critical for domains with underlying spatial, topological, or anatomical structure (Huang et al., 2024, Qu et al., 11 Nov 2025, Behrouz et al., 2024).
Grouped Parameterization: Grouped selective SSMs (GS6) tie parameters across axes/groups to reduce overfitting (notably in point clouds) (Qu et al., 11 Nov 2025).
Selective Domain Application: Speech-specific adaptations (e.g., selective bidirectionality only for audio tokens) avoid causality violations in autoregressive decoding (Masuyama et al., 2024).

References

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model (Huang et al., 2024)
Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition (Masuyama et al., 2024)
CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis (Qu et al., 11 Nov 2025)
MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking (Wang et al., 2024)
PointABM:Integrating Bidirectional State Space Model with Multi-Head Self-Attention for Point Cloud Analysis (Chen et al., 2024)
Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation (Jiang et al., 2024)
Graph Mamba: Towards Learning on Graphs with State Space Models (Behrouz et al., 2024)
HSIDMamba: Exploring Bidirectional State-Space Models for Hyperspectral Denoising (Liu et al., 2024)