Mamba YOLO: SSM-Enhanced Object Detection
- Mamba YOLO is a family of object detection architectures that integrate structured state space models into the YOLO pipeline, enhancing global context and real-time performance.
- It leverages Mamba modules in the backbone, neck, and detection head to achieve efficient O(N) scaling with improved accuracy across varied applications.
- Empirical evaluations show superior trade-offs in accuracy and speed on benchmarks such as COCO, UAV imagery, medical histopathology, and underwater detection.
Mamba YOLO denotes a family of object detection architectures that integrate structured State Space Models (SSMs)—specifically, Mamba modules or Vision Mamba blocks—into the YOLO ("You Only Look Once") detection pipeline. This hybrid approach aims to combine YOLO's real-time end-to-end object detection strengths with the Mamba model’s ability to capture long-range dependencies via efficient, linear-complexity global context modeling. Mamba YOLO models have demonstrated strong empirical gains across standard vision detection tasks, multi-modal UAV imagery, open-vocabulary object detection, medical histopathology, underwater scenes, and facial expression recognition, often delivering superior accuracy-complexity trade-offs compared to both pure CNN and Transformer-based baselines (Wang et al., 2024, Wang et al., 2024, Cao et al., 24 Nov 2025, Li et al., 1 Jul 2025, Malekmohammadi et al., 2024, Ma et al., 2024, Liao et al., 26 Feb 2026, Lei et al., 4 Jun 2025, Badiezadeh et al., 2024).
1. Core Architectural Principles
The Mamba YOLO framework adheres to the canonical detector pipeline: backbone → neck (feature aggregator) → detection head. The central distinguishing element is the inclusion of SSM-based modules—typically, "selective scan" blocks based on (discretized) linear ODEs—at various locations:
- Backbone: Feature extraction incorporates SSMs (e.g., ODSSBlocks, SS2D, Vision Mamba blocks) to imbue the network with global context, overcoming the limited receptive field of pure convolution and avoiding the quadratic scaling of self-attention (Wang et al., 2024, Lei et al., 4 Jun 2025).
- Neck: Mamba-based fusion modules are deployed in place of conventional FPN/PAFPN blocks, e.g., MambaFusion-PAN, DGC-MFM, Fusion Mamba, or HFAN (Wang et al., 2024, Cao et al., 24 Nov 2025, Li et al., 1 Jul 2025). These necks provide feature aggregation across spatial scales or modalities, with SSM-guided recurrent fusion.
- Detection Head: Most versions retain a decoupled YOLO-style head, though some (e.g., open-vocabulary variants) add extra contrastive or multimodal heads (Wang et al., 2024).
Key SSM instantiations follow the continuous-to-discrete formalism: with zero-order hold discretization
where module parameters A, B, C, D may be learned or dynamically conditioned on input.
2. Model Variants and Their Specific Components
Several Mamba YOLO variants have been proposed, each targeting differing detection scenarios or modalities:
| Variant | Backbone SSM | Neck/Fusion | Benchmark SOTA | Key Innovation |
|---|---|---|---|---|
| Mamba YOLO (Wang et al., 2024) | ODSSBlock+SS2D | PAFPN+ODSSBlock | COCO | Pure SSM backbone + RG Block |
| Mamba-YOLO-World (Wang et al., 2024) | YOLOv8+CLIP | MambaFusion-PAN (PGSS/SGSS) | COCO/LVIS | global mod. fusion |
| FER-YOLO-Mamba (Ma et al., 2024) | CSPDarknet+VSS | FPN+FER-YOLO-VSS dual branch | RAF-DB, SFEW | Local/Global dual-branch |
| SPMamba-YOLO (Liao et al., 26 Feb 2026) | ODSSBlock | PAFPN+SPPELAN+PSA+Mamba SSM head | URPC2022 | Multiscale context, global |
| MambaRefine-YOLO (Cao et al., 24 Nov 2025) | Dual-Stream CNN+Mamba | DGC-MFM, HFAN | DroneVehicle | Dual-gated RGB/IR fusion |
| UAVD-Mamba (Li et al., 1 Jul 2025) | DTMB (def. tokens) | Fusion Mamba, DNM (YOLOv11) | DroneVehicle | Deformable tokens, FFM |
| MambaNeXt-YOLO (Lei et al., 4 Jun 2025) | MambaNeXt Block | MAFPN | Pascal VOC | CNN/SSM hybrid block |
For medical histopathology (e.g., prostate cancer grading), both YOLOv8 variants and Vision Mamba models have been applied and compared (Malekmohammadi et al., 2024, Badiezadeh et al., 2024).
3. SSM Block Mechanics: Theory and Implementation
At the block level, Mamba-based SSMs operate as follows:
- State evolution: hidden states are recursively updated via parameterized discretizations of linear SSMs; input-dependent gating (e.g., ) or parallelization over spatial 2D slices is used.
- Feature injection: Selective scan unfolds input tensors in multiple spatial directions (rows, columns, diagonals), applies 1D SSM, and merges directional outputs, enabling global context at cost (Wang et al., 2024, Ma et al., 2024).
- Hybridization: Weaknesses of SSM-only composition (e.g., poor local detail or channel mixing) are alleviated by additional channel-split or residual gated (RG) blocks, local convolutions, or learned attention/gating (Wang et al., 2024, Lei et al., 4 Jun 2025, Ma et al., 2024).
In multi-modal and open-vocabulary settings, SSM fusion modules condition the recurrence parameters on summaries from alternate modalities (e.g., THS/IHS in Mamba-YOLO-World; RGB/IR complementary gates in MambaRefine-YOLO) (Wang et al., 2024, Cao et al., 24 Nov 2025).
4. Empirical Performance and Comparisons
Mamba YOLO variants achieve state-of-the-art or near SOTA results on diverse detection tasks, with superior accuracy-efficiency trade-offs:
- Object Detection (COCO 640×640):
- Mamba YOLO-T: 44.5 AP / 5.8M params / 1.5ms, outperforming YOLOv8-n by +7.2 AP.
- Mamba YOLO-B: 49.1 AP / 19.1M params / 2.2ms, +2–3 AP over similar-FLOP YOLO baselines.
- Mamba YOLO-L: 52.1 AP / 57.6M params / 4.3ms (Wang et al., 2024).
- Open-Vocabulary Detection: Mamba-YOLO-World-S: 27.7 AP (zero-shot LVIS), exceeding YOLO-World-S by 1.5 AP at fixed FLOPs (Wang et al., 2024).
- UAV Small Object / Multimodal: MambaRefine-YOLO: 83.2% [email protected] on DroneVehicle, +7.9% vs. RGB YOLO11 (Cao et al., 24 Nov 2025). UAVD-Mamba: 83.0% [email protected], outperforms OAFA by +3.6 (Li et al., 1 Jul 2025).
- Underwater Detection: SPMamba-YOLO: [email protected] = 0.825 on URPC2022, +4.9% over YOLOv8n baseline (Liao et al., 26 Feb 2026).
- Medical Histopathology: Vision Mamba: 85.3% F1 and 85.1% accuracy on Gleason2019, surpassing YOLOv8x (82.9/83.7%) (Malekmohammadi et al., 2024). H-vmunet achieves 0.92 Dice vs. 0.83 for YOLOv8m on segmentation (Badiezadeh et al., 2024).
- Facial Expression Recognition: FER-YOLO-Mamba: mAP 80.31% RAF-DB, +1.91% over YOLOvX, in real-time (Ma et al., 2024).
5. Complexity, Scaling, and Computational Efficiency
A central motivation for SSM integration is to avoid the quadratic cost of self-attention while achieving global receptive fields. Core findings include:
- SSM/Selective Scan: time and memory for sequence/image size per scan direction (Wang et al., 2024, Lei et al., 4 Jun 2025, Ma et al., 2024). Linear scaling enables high-resolution input (e.g., ) with practical resource use.
- Model size vs. speed: Tiny and base models offer strong AP-latency profiles suitable for edge deployment (e.g., Mamba YOLO-T at 1.5 ms, 5.8M), and are quantization-friendly (Wang et al., 2024, Lei et al., 4 Jun 2025).
- Runtime: Mamba YOLO inference times are competitive with, or outperform, Transformer-based and CNN-only models on both server- and edge-class devices (e.g., Jetson Orin NX: 31.9 FPS for MambaNeXt-YOLO at 66.6 mAP) (Lei et al., 4 Jun 2025).
6. Domain-Specific Deployment and Limitations
Mamba YOLO adoption is domain-agnostic, supporting application in rapidly triaged clinical workflows, UAV surveillance, underwater perception, facial analysis, and open-vocabulary image understanding:
- Strengths: Robust precision/recall trade-off from SSM global context; real-time throughput maintained; flexible adaptation to multi-modal and fine-grained detection (Wang et al., 2024, Cao et al., 24 Nov 2025, Li et al., 1 Jul 2025, Malekmohammadi et al., 2024, Ma et al., 2024).
- Weaknesses: SSM modules increase memory footprint over pure convolution (notably for some medical models, ~2.5× slower per patch); local detail may require RG/block supplement; SSM parametrizations may need optimization for deployment in latency-critical or memory-constrained settings (Malekmohammadi et al., 2024, Wang et al., 2024, Kolarijani et al., 3 May 2025).
- Hybrid models: Architectures combining YOLO (for fast first-pass) and Vision Mamba (for confirmatory grading) are proposed for clinical settings (Malekmohammadi et al., 2024).
7. Research Directions and Practical Considerations
Ongoing and suggested research at the intersection of SSMs and YOLO includes:
- Student-teacher or distillation schemes: SSM-based students distilled from larger Mamba models to match speed and compress deployment footprints (Malekmohammadi et al., 2024, Liao et al., 26 Feb 2026).
- Multimodal/knowledge fusion: Encoding clinical or contextual metadata alongside image sequences using parallel MLP or SSM heads (Cao et al., 24 Nov 2025, Wang et al., 2024).
- Further SSM enhancements: Incorporation of deformable tokens, multi-branch fusion, and dynamic attention for increased robustness in complex environments (Li et al., 1 Jul 2025, Cao et al., 24 Nov 2025).
- Model compression and acceleration: Quantization, pruning, or CUDA/C++ kernel fusion for production environments, particularly in digital pathology and real-time vision systems (Wang et al., 2024, Malekmohammadi et al., 2024).
References
- (Wang et al., 2024) Mamba YOLO: A Simple Baseline for Object Detection with State Space Model
- (Wang et al., 2024) Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
- (Cao et al., 24 Nov 2025) MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
- (Li et al., 1 Jul 2025) UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection
- (Malekmohammadi et al., 2024) Classification of Gleason Grading in Prostate Cancer Histopathology Images Using Deep Learning Techniques: YOLO, Vision Transformers, and Vision Mamba
- (Ma et al., 2024) FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State Space
- (Liao et al., 26 Feb 2026) SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling
- (Lei et al., 4 Jun 2025) MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
- (Badiezadeh et al., 2024) Segmentation Strategies in Deep Learning for Prostate Cancer Diagnosis: A Comparative Study of Mamba, SAM, and YOLO