Spiking-YOLO: Energy-Efficient SNN Detection

Updated 23 February 2026

The paper introduces Spiking-YOLO models that use channel-wise normalization and signed IF neurons to convert Tiny YOLO into energy-efficient SNNs, achieving 97.8% mAP with >2000× energy savings.
Advances such as surrogate-gradient descent, full-spike residual blocks, and spike-driven decoding enable detection in 4-6 time steps, improving mAP and reducing latency compared to conversion-based methods.
Variants like SU-YOLO and SpikeYOLO demonstrate the approach’s versatility for challenging domains, including underwater and low-light conditions, while maintaining high detection accuracy.

Spiking-YOLO refers to a family of object detection models that adapt the YOLO (You Only Look Once) paradigm to spiking neural networks (SNNs). These models leverage the event-driven and sparse computation properties of SNNs for high energy efficiency while targeting performance competitive with conventional artificial neural networks (ANNs) in regression-centric detection tasks. The evolution of Spiking-YOLO encompasses both direct ANN-to-SNN conversion methods and, more recently, directly trained end-to-end SNN architectures optimized for object detection on both frame-based and neuromorphic datasets.

1. Core Principles and Early Innovations

The initial Spiking-YOLO model established two key architectural mechanisms required to enable deep SNNs to perform continuous regression for object detection—tasks historically dominated by ANNs:

Channel-Wise Normalization: To minimize vanishing or saturating spike rates across channels, weights and biases are normalized per channel using high quantiles of pre-conversion activations. This ensures robust information flow in deep SNN hierarchies for spatially precise detection.
Signed Integrate-and-Fire (IF) Neuron with Imbalanced Thresholds: This model implements leaky-ReLU activation by assigning each IF neuron dual thresholds ( $V_{th}^+, V_{th}^-$ ) linked to positive and negative spikes, with $V_{th}^- = -V_{th} / \alpha$ , faithfully mapping leakage slope parameters into hardware-amenable threshold comparisons (Kim et al., 2019).

This approach enabled a direct conversion of pre-trained Tiny YOLO networks into SNNs, with outputs read from membrane potential accumulators to achieve high regression precision (e.g., 97.8% of Tiny YOLO mAP on VOC, and $>2000\times$ energy savings on neuromorphic hardware (Kim et al., 2019)).

2. Advances in Training Methodologies and Architectural Design

Subsequent research introduced directly trained SNN architectures and improved surrogate-gradient techniques, overcoming the limitations of conversion-based methods, which require thousands of time steps for convergence and are constrained to shallow backbones.

Surrogate-Gradient Descent: Models such as EMS-YOLO employ end-to-end training with backpropagation-through-time (BPTT) and differentiable surrogate spike functions. This enables deep spiking architectures (e.g., ResNet-like SNNs) to be trained for detection with only 4 time steps, offering 6-fold energy savings compared to ANNs at near-ANN accuracy (Su et al., 2023).
Full-Spike Residual Blocks: Residual modules are adapted to the spike domain. EMS-YOLO introduces “EMS-Blocks”—residual connections composed entirely of convolutional and LIF layers, avoiding analog shortcuts and ensuring block dynamical isometry for stable deep network training.
Spike-Driven Decoding Strategies: The Current Mean Decoding (CMD) technique enables precise real-valued prediction of bounding boxes and objectness from populations of spikes, overcoming limitations of rate-based decoding (Luo et al., 2023). CMD outputs the averaged synaptic current over all time steps, compatible with regression targets.

These strategies permit SNNs to achieve mAPs of 61.87% on PASCAL VOC in only 6 time steps (SNN-YOLOv3), compared to conversion-based Spiking-YOLO’s 51.8% at $T=3500$ , a nearly 10% improvement coupled with two orders of magnitude reduction in energy (Luo et al., 2023).

3. Network Structures, Latency Reduction Techniques, and ANN-SNN Alignment

Addressing the high latency and feature-collapse issues in early SNN detectors, newer models incorporate:

Timesteps Compression: Methods such as SUHD compress information from hundreds of time steps into a handful of “super-timesteps” (e.g., $f_c=16 \rightarrow T_c=4$ ), maintaining detection accuracy while drastically cutting inference time (Qu et al., 2023).
Spike-Time-Dependent Integrated (STDI) Coding: SUHD leverages time-varying thresholds $\theta(t)=\tau(t)\cdot v_{thr}$ to encode more information per spike, broadening the representational capacity and minimizing quantization error (Qu et al., 2023).
ANN-SNN Alignment via Quantization: Low-latency variants introduce Quant-ReLU activations during ANN pre-training to align post-activation values to the target SNN’s firing rate resolution ($1/T$), thus eliminating much of the quantization error upon conversion (Qiu et al., 2023). Additional “residual fix” strategies halve membrane-potential bias due to remainder accumulation, expediting convergence at lower timesteps.

The table below summarizes latency and accuracy trade-offs for leading methods:

Model	[email protected] VOC	Time Steps	Energy Savings
Spiking-YOLO (Kim et al., 2019)	51.8%	3500	$>2000\times$ vs. ANN
EMS-YOLO (Su et al., 2023)	50.1%	4	$5.8\times$
SNN-YOLOv3 (Luo et al., 2023)	61.9%	6	$158\times$
SUHD (Qu et al., 2023)	75.3%	4	$>200\times$
Low-Latency SNN (Qiu et al., 2023)	54.2%	300	Not specified

4. Model Variants and Application Domains

Several distinct Spiking-YOLO variants have been tailored for application-specific domains and measurement settings:

SU-YOLO: Forged for underwater detection in complex optical conditions, SU-YOLO combines spike-based denoising, time-step separated batch normalization (SeBN), and CSPNet-inspired SU-Blocks to combat spike degradation. It achieves 78.8% [email protected] on URPC2019 with only 2.98 mJ per image—outperforming both ANN and SNN baselines of comparable parameter count (Li et al., 31 Mar 2025).
SpikeYOLO (2024): Advances integer-valued (I-LIF) spiking neurons, training with integer activations and inferring via “virtual timesteps.” This yields 66.2% mAP@50 and 48.9% mAP@50:95 on COCO (large variant, $T\times D = 1 \times 4$ ), surpassing prior SNNs and matching/surpassing ANN performance with 3–5 $\times$ energy savings (Luo et al., 2024).
SCOD: Interleaves conventional and spiking convolutions for robust detection under low-illumination (dark object) conditions, achieving 66.01% mAP on VOC (Ali et al., 2023).
QCFS Tiny YOLO/SNN Tiny YOLO: Establishes the QCFS activation for guaranteed zero expected conversion error, allowing effective deployment on resource-constrained edge devices and offering a practical stepping stone to full SNN inference without loss in mAP (Ambati et al., 2023).

5. Challenges, Open Problems, and Directions for Future Research

Despite considerable advances, several limitations and open research challenges remain in Spiking-YOLO and SNN object detection:

Latency-Accuracy Trade-Offs: Conversion-based SNNs still require more inference steps ( $T=10^2$ – $10^4$ ) than directly trained models; continually compressing $T$ without loss in precision is an active research topic (Qu et al., 2023, Qiu et al., 2023).
Decoding Regression Outputs: Accurate real-valued decoding from sparse spike trains remains challenging. CMD and STDI are recent advances but may still be sub-optimal for complex detection heads or anchor-free architectures.
Extension to Deep, Modern Backbones: While SUHD and SU-YOLO enable lossless SNN conversion for YOLOv5/YOLOv7-scale architectures, significant engineering remains to generalize such approaches to even deeper, transformer-based, or multi-modal backbones (Qu et al., 2023, Li et al., 31 Mar 2025).
Hardware Realization: The sparse-event regime of SNNs is still not fully exploited on general-purpose hardware. Dedicated neuromorphic platforms (e.g., TrueNorth, Loihi, SpiNNaker) remain underutilized in the detection domain due to engineering and support challenges.
Robustness and Uncertainty: The event-driven, temporally sparse nature of SNNs offers potential for robust performance under noise and low SNR conditions, as shown for dark or underwater deployments. However, systematic study of uncertainty estimation and out-of-distribution detection remains an open frontier (Ali et al., 2023, Li et al., 31 Mar 2025).

6. Significance and Impact

Spiking-YOLO models have demonstrated that dense, real-time object detection is feasible with SNNs at near-ANN performance and with orders-of-magnitude lower energy consumption. This renders these models particularly relevant for edge AI, autonomous robotics, industrial safety, and resource-constrained environments (e.g., underwater, low-light, or mobile platforms). The introduction of integer/coded spiking neurons, precise ANN-SNN alignment, and fully spiking residual/neck designs pushes the state-of-the-art in energy-efficient deep learning and opens avenues for further cross-pollination between neuromorphic engineering and modern computer vision.

There is now clear evidence that with innovations in spike encoding, normalization, training, and decoding, SNN-based detectors can match or even exceed their ANN counterparts on benchmarks—often with 100 $\times$ – $1000\times$ improved energy profiling—heralding practical deployments in mobile perception and real-time sensory edge processing (Li et al., 31 Mar 2025, Luo et al., 2024, Qu et al., 2023, Su et al., 2023).