Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba YOLO: SSM-Enhanced Object Detection

Updated 6 March 2026
  • Mamba YOLO is a family of object detection architectures that integrate structured state space models into the YOLO pipeline, enhancing global context and real-time performance.
  • It leverages Mamba modules in the backbone, neck, and detection head to achieve efficient O(N) scaling with improved accuracy across varied applications.
  • Empirical evaluations show superior trade-offs in accuracy and speed on benchmarks such as COCO, UAV imagery, medical histopathology, and underwater detection.

Mamba YOLO denotes a family of object detection architectures that integrate structured State Space Models (SSMs)—specifically, Mamba modules or Vision Mamba blocks—into the YOLO ("You Only Look Once") detection pipeline. This hybrid approach aims to combine YOLO's real-time end-to-end object detection strengths with the Mamba model’s ability to capture long-range dependencies via efficient, linear-complexity global context modeling. Mamba YOLO models have demonstrated strong empirical gains across standard vision detection tasks, multi-modal UAV imagery, open-vocabulary object detection, medical histopathology, underwater scenes, and facial expression recognition, often delivering superior accuracy-complexity trade-offs compared to both pure CNN and Transformer-based baselines (Wang et al., 2024, Wang et al., 2024, Cao et al., 24 Nov 2025, Li et al., 1 Jul 2025, Malekmohammadi et al., 2024, Ma et al., 2024, Liao et al., 26 Feb 2026, Lei et al., 4 Jun 2025, Badiezadeh et al., 2024).

1. Core Architectural Principles

The Mamba YOLO framework adheres to the canonical detector pipeline: backbone → neck (feature aggregator) → detection head. The central distinguishing element is the inclusion of SSM-based modules—typically, "selective scan" blocks based on (discretized) linear ODEs—at various locations:

  • Backbone: Feature extraction incorporates SSMs (e.g., ODSSBlocks, SS2D, Vision Mamba blocks) to imbue the network with global context, overcoming the limited receptive field of pure convolution and avoiding the quadratic scaling of self-attention (Wang et al., 2024, Lei et al., 4 Jun 2025).
  • Neck: Mamba-based fusion modules are deployed in place of conventional FPN/PAFPN blocks, e.g., MambaFusion-PAN, DGC-MFM, Fusion Mamba, or HFAN (Wang et al., 2024, Cao et al., 24 Nov 2025, Li et al., 1 Jul 2025). These necks provide feature aggregation across spatial scales or modalities, with SSM-guided recurrent fusion.
  • Detection Head: Most versions retain a decoupled YOLO-style head, though some (e.g., open-vocabulary variants) add extra contrastive or multimodal heads (Wang et al., 2024).

Key SSM instantiations follow the continuous-to-discrete formalism: hË™(t)=Ah(t)+Bx(t)y(t)=Ch(t)+Dx(t)\dot h(t) = A h(t) + B x(t) \qquad y(t) = C h(t) + D x(t) with zero-order hold discretization

hk=A‾hk−1+B‾xk,yk=Chk+Dxkh_k = \overline{A} h_{k-1} + \overline{B} x_k,\qquad y_k = C h_k + D x_k

where module parameters A, B, C, D may be learned or dynamically conditioned on input.

2. Model Variants and Their Specific Components

Several Mamba YOLO variants have been proposed, each targeting differing detection scenarios or modalities:

Variant Backbone SSM Neck/Fusion Benchmark SOTA Key Innovation
Mamba YOLO (Wang et al., 2024) ODSSBlock+SS2D PAFPN+ODSSBlock COCO Pure SSM backbone + RG Block
Mamba-YOLO-World (Wang et al., 2024) YOLOv8+CLIP MambaFusion-PAN (PGSS/SGSS) COCO/LVIS O(N)O(N) global mod. fusion
FER-YOLO-Mamba (Ma et al., 2024) CSPDarknet+VSS FPN+FER-YOLO-VSS dual branch RAF-DB, SFEW Local/Global dual-branch
SPMamba-YOLO (Liao et al., 26 Feb 2026) ODSSBlock PAFPN+SPPELAN+PSA+Mamba SSM head URPC2022 Multiscale context, global
MambaRefine-YOLO (Cao et al., 24 Nov 2025) Dual-Stream CNN+Mamba DGC-MFM, HFAN DroneVehicle Dual-gated RGB/IR fusion
UAVD-Mamba (Li et al., 1 Jul 2025) DTMB (def. tokens) Fusion Mamba, DNM (YOLOv11) DroneVehicle Deformable tokens, FFM
MambaNeXt-YOLO (Lei et al., 4 Jun 2025) MambaNeXt Block MAFPN Pascal VOC CNN/SSM hybrid block

For medical histopathology (e.g., prostate cancer grading), both YOLOv8 variants and Vision Mamba models have been applied and compared (Malekmohammadi et al., 2024, Badiezadeh et al., 2024).

3. SSM Block Mechanics: Theory and Implementation

At the block level, Mamba-based SSMs operate as follows:

  • State evolution: hidden states hth_t are recursively updated via parameterized discretizations of linear SSMs; input-dependent gating (e.g., Δ\Delta) or parallelization over spatial 2D slices is used.
  • Feature injection: Selective scan unfolds input tensors in multiple spatial directions (rows, columns, diagonals), applies 1D SSM, and merges directional outputs, enabling global context at O(N)O(N) cost (Wang et al., 2024, Ma et al., 2024).
  • Hybridization: Weaknesses of SSM-only composition (e.g., poor local detail or channel mixing) are alleviated by additional channel-split or residual gated (RG) blocks, local convolutions, or learned attention/gating (Wang et al., 2024, Lei et al., 4 Jun 2025, Ma et al., 2024).

In multi-modal and open-vocabulary settings, SSM fusion modules condition the recurrence parameters on summaries from alternate modalities (e.g., THS/IHS in Mamba-YOLO-World; RGB/IR complementary gates in MambaRefine-YOLO) (Wang et al., 2024, Cao et al., 24 Nov 2025).

4. Empirical Performance and Comparisons

Mamba YOLO variants achieve state-of-the-art or near SOTA results on diverse detection tasks, with superior accuracy-efficiency trade-offs:

  • Object Detection (COCO 640×640):
    • Mamba YOLO-T: 44.5 AP / 5.8M params / 1.5ms, outperforming YOLOv8-n by +7.2 AP.
    • Mamba YOLO-B: 49.1 AP / 19.1M params / 2.2ms, +2–3 AP over similar-FLOP YOLO baselines.
    • Mamba YOLO-L: 52.1 AP / 57.6M params / 4.3ms (Wang et al., 2024).
  • Open-Vocabulary Detection: Mamba-YOLO-World-S: 27.7 AP (zero-shot LVIS), exceeding YOLO-World-S by 1.5 AP at fixed FLOPs (Wang et al., 2024).
  • UAV Small Object / Multimodal: MambaRefine-YOLO: 83.2% [email protected] on DroneVehicle, +7.9% vs. RGB YOLO11 (Cao et al., 24 Nov 2025). UAVD-Mamba: 83.0% [email protected], outperforms OAFA by +3.6 (Li et al., 1 Jul 2025).
  • Underwater Detection: SPMamba-YOLO: [email protected] = 0.825 on URPC2022, +4.9% over YOLOv8n baseline (Liao et al., 26 Feb 2026).
  • Medical Histopathology: Vision Mamba: 85.3% F1 and 85.1% accuracy on Gleason2019, surpassing YOLOv8x (82.9/83.7%) (Malekmohammadi et al., 2024). H-vmunet achieves 0.92 Dice vs. 0.83 for YOLOv8m on segmentation (Badiezadeh et al., 2024).
  • Facial Expression Recognition: FER-YOLO-Mamba: mAP 80.31% RAF-DB, +1.91% over YOLOvX, in real-time (Ma et al., 2024).

5. Complexity, Scaling, and Computational Efficiency

A central motivation for SSM integration is to avoid the quadratic cost of self-attention while achieving global receptive fields. Core findings include:

  • SSM/Selective Scan: O(N)O(N) time and memory for sequence/image size NN per scan direction (Wang et al., 2024, Lei et al., 4 Jun 2025, Ma et al., 2024). Linear scaling enables high-resolution input (e.g., 1024×10241024\times1024) with practical resource use.
  • Model size vs. speed: Tiny and base models offer strong AP-latency profiles suitable for edge deployment (e.g., Mamba YOLO-T at 1.5 ms, 5.8M), and are quantization-friendly (Wang et al., 2024, Lei et al., 4 Jun 2025).
  • Runtime: Mamba YOLO inference times are competitive with, or outperform, Transformer-based and CNN-only models on both server- and edge-class devices (e.g., Jetson Orin NX: 31.9 FPS for MambaNeXt-YOLO at 66.6 mAP) (Lei et al., 4 Jun 2025).

6. Domain-Specific Deployment and Limitations

Mamba YOLO adoption is domain-agnostic, supporting application in rapidly triaged clinical workflows, UAV surveillance, underwater perception, facial analysis, and open-vocabulary image understanding:

7. Research Directions and Practical Considerations

Ongoing and suggested research at the intersection of SSMs and YOLO includes:

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba YOLO.