LBMamba: Efficient Vision SSM Architectures
- LBMamba is a suite of SSM-based architectures that replace quadratic self-attention with linear-time scans to enhance throughput and reduce memory usage.
- It includes innovations like locally bi-directional scans and low-rank compressions, as seen in variants like LBVim, LightM-UNet, and MambaLiteSR.
- These architectures achieve state-of-the-art results in vision, medical imaging, and neural operator learning while enabling efficient real-time edge deployment.
LBMamba refers to multiple independent developments that leverage the Mamba State Space Model (SSM) paradigm to achieve high-throughput, memory-efficient sequence modeling in vision, imaging, and scientific operator learning. Across these advances, LBMamba architectures consistently exploit SSMs’ linear complexity to replace or augment self-attention, with careful architectural innovations targeting either efficiency, bidirectional context integration, or lightweight deployment. The LBMamba term encompasses: (1) Locally Bi-directional Mamba for efficient bi-contextual sequence modeling in vision backbones (Zhang et al., 19 Jun 2025); (2) LightM-UNet, a “pure-Mamba” SSM-based segmentation network for lightweight medical imaging (Liao et al., 2024); (3) MambaLiteSR, a low-rank, knowledge-distilled Vision-Mamba SR network for edge devices (Aalishah et al., 19 Feb 2025); and related contributions in SR, depth estimation, and neural-operator learning. This article systematically reviews core algorithmic concepts, formal SSM principles, representative architectures, quantitative efficiency/accuracy results, and the implications of LBMamba variants for real-world deployment.
1. State Space Models and the Mamba Principle
At the core of LBMamba variants lies the Mamba SSM, an architecture that replaces traditional quadratic-complexity self-attention with a linear-time selective scan (Zhang et al., 19 Jun 2025). The general SSM is defined by the recurrence: or, in discretized Mamba form, with selective input-dependent adaptation of , , . Vision Mamba adapts this mechanism for high-dimensional visual data by flattening spatial dimensions and applying the SSM across sequence-like structures derived from images, patches, or even higher-dimensional tensor spaces. The architectural reduction from attention to SSM scan reduces time and space complexity from to for sequence elements, making large-scale sequence modeling tractable even on hardware with modest memory.
2. Locally Bi-directional Mamba (LBMamba) for Vision Backbones
The original Mamba SSM is unidirectional: each state only aggregates information from its past, leaving future context inaccessible. Early bidirectional adaptations—such as running a second global reverse scan—restore full receptive fields but double compute and memory. LBMamba introduces a locally bi-directional mechanism that resolves this inefficiency by performing a lightweight local backward scan within each thread’s register-resident memory (Zhang et al., 19 Jun 2025). This design partitions the sequence into blocks of size , with each thread executing both the parallel forward scan and a register-resident local backward scan: This eliminates off-chip memory transfers for the reverse scan and enables substantial gains in both throughput and memory, confirmed by kernel-level microbenchmarks with only a 2% runtime overhead for a 27% increase in arithmetic instructions. Empirically, eliminating the global reverse pass enables up to an 83% throughput boost and 20% reduction in GPU memory in tiny-model ImageNet classification (Zhang et al., 19 Jun 2025).
The LBVim vision backbone, constructed by stacking LBMamba blocks with alternating scan directions (achieved via sequence reversal every layer), attains a global receptive field across all tokens in blocks. Evaluations on ImageNet, ADE20K, COCO, and whole-slide pathological imaging confirm that LBVim achieves higher or comparable accuracy (e.g., +1.6 pp top-1 acc, +2.7 pp mIoU) at equivalent or higher throughput compared to baseline Vim/Mamba models. In pathology MIL pipelines, integration of LBMamba yields improvements of up to 3.06% AUC, 3.39% F1, and 1.67% accuracy (Zhang et al., 19 Jun 2025).
3. Pure-Mamba Lightweight Segmentation and Super-Resolution
In medical image segmentation, the LightM-UNet (also denoted LBMamba) replaces all CNN/Transformer blocks with SSM-based Residual Vision Mamba Layers, yielding a “pure Mamba” architecture (Liao et al., 2024). The RVM Layer applies a VSSM to each flattened spatial position, augmented by residual and normalization pathways to form a normalized skip: 0 Experimental comparisons demonstrate drastic gains in compactness: on the LiTS 3D dataset, LightM-UNet achieves 47.4× fewer parameters and 15.8× less computation than nnU-Net, with a +3.35% improvement in average mIoU. On 2D Montgomery–Shenzhen, it achieves accuracy parity with 116× fewer parameters. Ablations confirm that SSM-based modules outperform convolution and self-attention in this lightweight regime, while the skip terms yield additional nontrivial accuracy gains without increasing FLOPs.
For edge super-resolution, MambaLiteSR (LBMamba) introduces low-rank factorization in Vision-Mamba state-space modules, with 1 weights compressed to rank-2 factors 3: 4 Knowledge distillation is used to train a lightweight student model under guided supervision from a larger teacher, balancing teacher outputs and ground truth via a weighted L1 objective. Shrinking the embedding dimension (5) and the low-rank parameter (6) leads to negligible degradation in PSNR (e.g., 28.88 vs 28.81 dB for 7 and 8). On NVIDIA Jetson Orin Nano, MambaLiteSR achieves 71 FPS at under 3 W, with up to 58% lower power draw compared to CNN-based edge SR, and competitive PSNR/SSIM (Aalishah et al., 19 Feb 2025).
4. LBMamba Variants and Extensions in Depth and Operator Learning
LBMamba-informed architectural patterns appear in additional modalities:
- Light Field Super-Resolution: Mamba-based Light Field SR (MLFSR) applies bidirectional Mamba scans within well-chosen spatial, angular, and EPI (epipolar plane) subspaces, efficiently recovering spatial-angular correlations without resorting to quadratic attention (Gao et al., 2024). MLFSR achieves Transformer-comparable PSNR/SSIM with 2–4× faster inference, low memory use, and end-to-end processing of high-res 4D LFs.
- Monocular Depth Estimation: LMDepth integrates a pyramid spatial pooling module for multi-scale global context and a linear-complexity Mamba block in the decoder (Long et al., 2 May 2025). On NYUDv2 and KITTI, LMDepth attains state-of-the-art accuracy with just 2.9 M parameters and <1.1 GFLOPs, yielding real-time inference (>120 FPS) even in INT8 quantization.
- Neural Operator Learning: The Latent Mamba Operator (LaMO) frames multidimensional mapping (e.g., PDE solution operators) in a latent space, applying (multi)directional SSM scans over compact 9-token representations for linear scalability, avoiding over-smoothing, and outperforming transformers and other neural operators by 32.3% in error reduction across PDE benchmarks (Tiwari et al., 25 May 2025).
5. Quantitative Benchmarks and Empirical Trade-offs
Prototypical performance metrics for LBMamba-based systems are detailed below.
| Model | Params | GFLOPs/fwd | Accuracy (main) | Throughput | Notable Trade-off |
|---|---|---|---|---|---|
| LBVim-Ti (Zhang et al., 19 Jun 2025) | 6M | 1.4 | 73.7% top-1 (ImageNet) | 1621 img/s | +82% throughput vs. Vim-Ti |
| LightM-UNet (Liao et al., 2024) | 1.87M (3D), 1.09M (2D) | 457.6, 267.2 | 77.48% mIoU (LiTS) | real-time edge | >100× smaller than nnU-Net |
| MambaLiteSR (Aalishah et al., 19 Feb 2025) | 315k | 4.5 | 28.28 dB (Set5) | 71 FPS | 58% less power vs. eSR |
| LMDepth (Long et al., 2 May 2025) | 2.9M | 1.08 | 0.908 0 (KITTI) | 120+ FPS | Realtime INT8 edge deploy |
| MLFSR (Gao et al., 2024) | 1.36M | — | 35.22 dB (EPFL, 2× SR) | 27.8 ms | 2–4× faster than Transformer |
This spectrum of results demonstrates that LBMamba architectures reach comparable or better accuracy than attention-centric or convolutional counterparts, while deploying in real-time on edge hardware and maintaining energy efficiency. Additionally, memory overhead is typically <1–4GB for neural operators and vision models at practical batch/image sizes.
6. Design Considerations, Ablations, and Limitations
LBMamba’s key efficiency arises from careful exploitation of SSMs’ linear complexity and data locality. Locally bi-directional kernels are implemented register-resident to avoid off-chip communication. Layer normalization, channel-mixing (MLP) heads, and skip connections are critical for training stability and maintaining representational power at extreme model compression. Ablations across LightM-UNet and LBVim confirm that removing the locally bi-directional mechanics or the sequence-reversing step in stacked architectures causes significant drops in accuracy or receptive field, despite only minimal gains in compute. In super-resolution/regression tasks, excessive compression (e.g., via too-low-rank factors) produces only a marginal (∼0.07 dB) reduction in PSNR—indicating robustness to aggressive parameter pruning. Current limitations are noted in capturing global interactions at depth or very long-range, and in the partial loss of fine token-pair detail due to SSM hidden-state compression (notably in MLFSR); Transformer-to-Mamba distillation partially mitigates these issues.
7. Implications and Future Directions
LBMamba advances have established SSM-based architectures as practical and scalable replacements for attention-based models in diverse imaging pipelines, providing a consistent means for high-throughput, low-power, and small-footprint deployment. Future work is identified in adaptive or learned local windowing for bidirectional SSMs, cross-modal fusions (e.g., with video/text), and extension to ultra-long sequence domains (audio, genomics). Additionally, applying kernel-integral operator frameworks and multidirectional SSM scans presents opportunities for further efficiency gains in operator learning and scientific ML. Across applications, LBMamba and its variants demonstrate new possibilities for deploying advanced sequence modeling on edge and embedded platforms without sacrificing state-of-the-art performance.
Key references: (Zhang et al., 19 Jun 2025, Liao et al., 2024, Aalishah et al., 19 Feb 2025, Gao et al., 2024, Long et al., 2 May 2025, Tiwari et al., 25 May 2025).