ExtBiMamba: Flexible Bidirectional SSMs
- ExtBiMamba is a family of advanced bidirectional state-space models that integrates forward and backward passes with cross-task and multi-scale interactions.
- It employs continuous-time SSM kernels for linear-time sequence modeling and adapts to diverse tasks such as language modeling, speaker diarization, and dense vision prediction.
- The design features flexible, hardware-friendly quantization supporting 1-bit to multi-bit precision, significantly reducing energy consumption and boosting inference throughput.
ExtBiMamba is a family of advanced, flexible state space models that generalize and extend the Bi-Mamba and Mamba architectures by enabling bidirectional, multi-task, and multi-bit extensions. The core design leverages continuous-time state-space model (SSM) kernels for linear-time sequence modeling and augments them with bidirectional information flow, cross-task/multi-scale interactions, and flexible quantization. These properties enable ExtBiMamba to scale efficiently to long sequences or high-dimensional inputs, with direct applications across language modeling, speaker diarization, and multi-task dense prediction in vision and speech domains (Tang et al., 2024, Liao et al., 27 Jan 2026, Cao et al., 28 Aug 2025).
1. Foundational Principles and Core Architecture
ExtBiMamba originates from the selective SSM block of Mamba. For a sequence of inputs , the SSM block computes hidden states and outputs via: where and are discretized SSM parameters, and , are learned output-mixing matrices. This structure allows global convolutional context with only linear time and memory cost in sequence length, scaling as .
To extend unidirectional SSMs, ExtBiMamba incorporates a second, backward SSM pass, mirroring the recurrence from sequence end-to-beginning, with its own independent parameters. Both forward and backward outputs are fused at each timestep using a learnable gating mechanism. This approach confers the ability to capture both past and future context, addressing the unidirectionality constraints of original Mamba and yielding richer sequence representations.
For multi-task settings, ExtBiMamba adapts its block structure to process per-task input streams and fuse their representations via advanced bidirectional scanning and cross-task feature refinement. In quantized variants, it also supports a spectrum of precision regimes, from aggressive 1-bit to multi-bit hybrid compositions (Tang et al., 2024, Cao et al., 28 Aug 2025).
2. Bidirectional and Cross-Task Recurrence Mechanisms
The bidirectional extension is mathematically formulated as follows for each timestep :
- Forward pass (past 0 present):
1
- Backward pass (future 2 present):
3
- Fusion using a sigmoid gate:
4
where 5 are trainable SSM parameters for each direction, and 6 parameterize the fusion MLP.
For multi-task applications, bidirectional interaction is generalized. In BIM (an ExtBiMamba implementation), task-specific features undergo both task-first and position-first bidirectional BI-Scan passes—each modeled as a sequence processed by SSM blocks, coordinating dependencies across tasks or spatial positions efficiently at 7 cost, where 8 is the number of tasks and 9 is the number of locations (Cao et al., 28 Aug 2025).
3. Flexible Quantization and Hybrid Precision
A central design axis in ExtBiMamba is its support for "flexible-bit" quantization, motivated by the Bi-Mamba 1-bit SSM framework. All linear layers are binarized as: 0 where the learnable 1 vectors preserve magnitude information. A straight-through estimator gradient is used to facilitate backpropagation through the sign operation.
ExtBiMamba extends this further via:
- Multi-bit hybridization: Majority of weights remain 1-bit, while select blocks leverage 2–3 bits to boost dynamic range where sensitivity is highest, e.g., in output projections or gating modules. This configuration balances model expressivity with compute and memory efficiency.
- Layerwise bit allocation: Early layers, responsible for extracting diverse features, may use higher precision; deeper layers revert to 1-bit, leveraging results demonstrating only modest perplexity increases under such partial binarization.
Energy, memory, and throughput are substantially improved with this quantization, with 1-bit matrix multiplies consuming up to 2–3 less energy than 16-bit floating point, and up to a 4 improvement in inference throughput over Transformer baselines on long contexts (Tang et al., 2024).
4. Advanced Scanning and Multi-Scale Interaction Modules
In multi-task vision contexts, ExtBiMamba introduces two scan-based modules:
- BI-Scan (Bidirectional Interaction Scan) alternates between task-first and position-first serializations across all branches, applying SSMs in both orderings and directions. The process includes:
- Serializing features by predetermined patterns
- Running Mamba-based SSMs linearly over concatenated sequences (length 5)
- Aggregating forward and backward outputs
- Fusing via gating masks for integration into each task-specific branch
- MS-Scan (Multi-Scale Scan) extracts features at several spatial granularities. Input channels are split, windowed into varying patch sizes, each processed by four-way 2D SSM scans, then recombined—supporting adaptive multi-scale context integration.
Integration of BI-Scan and MS-Scan within the Mamba Feature Refinement (MFR) block delivers both intra-task scene structure modeling and fine-grained, scalable cross-task interactions. This design yields only linear complexity in 6, compared to quadratic costs in naive cross-attention.
5. Empirical Performance and Applications
ExtBiMamba models have demonstrated consistently strong empirical results:
- Language modeling: Bi-Mamba at 780M scale achieves perplexity 7 (FP16: 8), vastly outperforming traditional post-training binarization baselines (e.g., GPTQ-3bit: 9; Bi-LLM: 0) (Tang et al., 2024).
- Speaker diarization: In ConBiMamba, ExtBiMamba achieves state-of-the-art performance on datasets such as AISHELL-4 (DER 9.8% vs. 10.5% for Mamba-diarization) and VoxConverse (8.6% vs. 9.3%) (Liao et al., 27 Jan 2026).
- Multi-task dense prediction (vision): On NYUD-V2, BIM achieves semantic segmentation mIoU of 1 (vs. 2 for baseline), depth RMSE 3 (vs. 4), and boundary odsF 5 (vs. 6) (Cao et al., 28 Aug 2025). On PASCAL-Context, comparable improvements accrue across all primary and auxiliary tasks.
The low memory and compute cost, combined with flexible quantization, make ExtBiMamba suitable for both resource-constrained edge inference and efficient large-scale training.
6. Computational Efficiency and Hardware Considerations
ExtBiMamba's critical efficiency properties derive from:
- Linear time and memory scaling: Both in sequence length (SSMs) and number of per-task branches (BI-Scan), contrasting the 7 and 8 scaling in Transformers and pairwise attention, respectively.
- Dense bit-packing: 1–3 bit weights enable dense storage in on-chip SRAM and fast bit-serial processing; recurrence updates in SSMs can be fused directly with binarized GEMMs.
- Specialized hardware mapping: The architecture is explicitly amenable to bit-serial ASIC acceleration, where bitwise-XOR and popcount operations replace high-precision multiplies, and scale-and-shift (via 9) is applied post hoc by lightweight ALUs.
A consequential implication is that inference remains fast and scalable even for very long contexts or high-resolution spatial grids, unlocking real-time or low-latency applications.
7. Limitations and Prospective Extensions
Current ExtBiMamba models rely on predetermined scanning patterns and fixed backward passes, which may limit the capture of arbitrary spatial or structural couplings. Forward and backward SSM computation doubles certain resource costs; future work may explore fused or adaptive bidirectional kernels. The increased hyperparameter space in multi-bit and multi-scan settings introduces additional tuning challenges.
Proposed extensions include:
- Learnable scan patterns: Reducing manual architectural choices by data-adaptive serialization (Cao et al., 28 Aug 2025).
- Mixture-of-experts or attention-based task selection: Dynamic compute allocation per task or spatial region.
- Deeper hardware–algorithm co-design: To further optimize bit-serial state updates and maximize inference throughput on emerging ASICs.
A plausible implication is continued accuracy gains with modest increases in energy and memory by incrementally reintroducing higher-bit modules along critical paths, as supported by empirical trade-offs observed in partially quantized models (Tang et al., 2024).
Summary Table of ExtBiMamba Features
| Aspect | Core Mechanism | Complexity |
|---|---|---|
| Sequence modeling | Bidirectional SSM + gating | 0 |
| Cross-task fusion | BI-Scan (task/position-first, bidir) | 1 |
| Multi-scale context | MS-Scan (windowed SSM at various scales) | 2 |
| Quantization | Fully 1-bit; hybrid 2–3 bit in key blocks | 3 (bits) |
| Hardware suitability | Bit-serial compute, SRAM packing | N/A |
ExtBiMamba thus presents a versatile, computation- and memory-efficient sequence model, capable of state-of-the-art performance in tasks spanning language, speech, and vision, underpinned by holistic modeling of context, bit-flexibility, and hardware awareness (Tang et al., 2024, Liao et al., 27 Jan 2026, Cao et al., 28 Aug 2025).