SpikeYOLO: Spiking Object Detection

Updated 11 March 2026

SpikeYOLO architecture is a family of spiking neural networks adapted from YOLO detectors, emphasizing ultra-low latency and energy efficiency.
It employs burst-firing, integer-valued neurons, and quantization strategies to mitigate conversion errors and maintain high detection accuracy.
SpikeYOLO achieves near-ANN performance with significant energy savings, enabling deployment on resource-constrained and event-driven platforms.

SpikeYOLO architectures constitute a family of spiking neural network (SNN) object detectors that adapt the YOLO line of artificial neural network (ANN) architectures for ultra-low-latency, energy-efficient, and high-accuracy object detection. These models operationalize direct conversion of state-of-the-art YOLO variants (Tiny-YOLO, YOLOv5, YOLOv8, YOLOv9) to SNNs by substituting convolutional and activation operations with spike-driven counterparts, together with loss-minimizing quantization strategies. Over several generations, the SpikeYOLO line addresses key SNN-specific obstacles—high conversion-induced quantization error, severe spike-rate degradation in deep layers, and poor timestep-to-precision scaling—using mechanisms such as burst-firing, integer-valued surrogate neuron models, CSP-style residual structures, and per-timestep normalization. SpikeYOLO architectures thus enable efficient deployment of SNN-based object detection on resource-constrained or event-driven platforms, with state-of-the-art accuracy and orders-of-magnitude reduction in computational latency and energy consumption (Qu et al., 2023, Luo et al., 2024, Li et al., 31 Mar 2025, Kim et al., 2019, Qiu et al., 2023).

1. Architectural Evolution and Fundamental Design Challenges

Early attempts at SNN-based object detection, exemplified by "Spiking-YOLO" (Kim et al., 2019), established the feasibility of converting shallow YOLO (Tiny-YOLO) architectures to SNNs via channel-wise normalization and novel signed-neuron thresholding to address the leaky-ReLU encoding challenge. However, performance suffered from high latency—requiring thousands of simulation timesteps to maintain detection accuracy due to quantization mismatch, spike loss in deep networks, and suboptimal pooling/ residual connections.

Subsequent works, including "Low Latency Spiking Neural Network for Object Detection" (Qiu et al., 2023), improved this by introducing explicit activation quantization and a residual-fix strategy. This reduced quantization residue, enabled effective training with lower timestep windows, and improved the fidelity of ANN-to-SNN mapping.

More recent architectures, notably "Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection" (SUHD, a.k.a. SpikeYOLO in its YOLOv5 form) (Qu et al., 2023) and "Integer-Valued Training and Spike-Driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection" (Luo et al., 2024), leverage architectural simplification (removing dense residuals and complex modules), burst-firing, integer-valued neurons, and custom coding schemes to achieve near-lossless conversion and low-latency high-accuracy inference.

2. Canonical SpikeYOLO Architectures: Topologies and Building Blocks

SpikeYOLO architectures instantiate a set of systematic replacements and additions to bring YOLO-style ANN object detectors into the spiking domain. The following table contrasts representative approaches:

SpikeYOLO Variant	Base ANN	Key SNN Adaptations
Spiking-YOLO (Kim et al., 2019)	Tiny-YOLO	Channel-wise normalization, signed neuron (imbalanced threshold)
SpikeYOLO (Qiu et al., 2023)	Tiny-YOLOv3	BN-fusion, ReLU→QuantReLU, Conv-replace-Pool, residual fix
SpikeYOLO/SUHD (Qu et al., 2023)	YOLOv5s	IF neuron everywhere, burst-firing/step compression, STDI coding, full FPN/PAN
Meta-Block SpikeYOLO (Luo et al., 2024)	YOLOv8	Network simplification, meta SNN blocks (token/channel mixers), integer-valued IF neuron, virtual sub-steps
SU-YOLO (Li et al., 31 Mar 2025)	YOLOv9	Spiking CSP residuals, separated BN, lightweight denoising, infinite-threshold heads

The meta-architecture in all cases preserves the canonical backbone/neck/head split of YOLO models, but removes layers or modules that provoke spike-rate drop (e.g., dense cascaded CSPs, max-pooling, complex residual trees). In SUHD and meta-block SpikeYOLO, each Conv+BN+ReLU stack is replaced by a spike-convolution module driving non-leaky IF (or integer IF/LIF) neurons, and all residual connections are exactly mapped such that spike trains from main and shortcut branches are merged additively at the membrane level before thresholding.

Detection heads are implemented as 1×1 spike-convs followed by IF neurons, with output decoded according to the same bounding-box and class/logit transformations as the originating YOLO model.

3. Neuron and Coding Models for Spiking-Driven Object Detection

Central to SpikeYOLO's performance are the neuron models and coding strategies that facilitate accurate information transfer and low quantization error under severe temporal compression.

Integrate-and-Fire (IF) and Leaky IF (LIF) Neurons: Standard IF/LIF dynamics comprise membrane update, spike-thresholding, and reset:

$V^l_{\text{mem}}(t) = V^l_{\text{mem}}(t-1) + z^l(t) - s^l(t-1) V_{\mathrm{thr}}$

with $z^l(t)=\sum_i w_i^{l-1} s_i^{l-1}(t)+b^l$ , $s^l(t)=H(V^l_{\text{mem}}(t)-V_{\mathrm{thr}})$ .

Burst-Firing and Timestep Compression (Qu et al., 2023): To mitigate the T→large requirement, SUHD allows neurons to emit multiple spikes ("burst") per step, effectively compressing $f_c$ original steps into one, with $T_c=T/f_c$ .
Spike-Time-Dependent Integrated (STDI) Coding (Qu et al., 2023): Rather than static thresholds, the threshold is time-varying $V_{\mathrm{thr}}(t) = \tau(t)v_{\mathrm{thr}}, \tau(t)=T-t+1$ , so each spike at time $t$ carries weight $\tau(t)$ . This expands the representational capacity per spike, preserving accuracy even at T=1–4.
Integer-Valued LIF with Virtual Steps (Luo et al., 2024): In meta-block SpikeYOLO, integer-valued output ("I-LIF") reduces rounding error during training. At inference, each integer-valued output expands into $D$ virtual binary substeps, restoring spike-driven sparsity.
Signed/Imbalanced Thresholds (Kim et al., 2019): For leaky-ReLU activations, dual thresholds enable positive and negative spike emission, mirroring scalar slope without multipliers.
Activation Quantization and Residual Fix (Qiu et al., 2023): Activation outputs are quantized using a discrete step ladder ( $\text{QuantReLU}_T$ ), aligning ANN activation range with SNN firing-rate granularity. Initializing the membrane with 0.5 offsets the residual quantization error, further reducing prediction drift.

4. SNN-Specific Structural and Normalization Strategies

SpikeYOLO's success critically depends on harmonizing the architectural features of ANNs with SNN constraints:

Structural Substitutions: Pooling layers (max or average) are consistently replaced with strided convolutions to attenuate spike loss induced by data downsampling (Qiu et al., 2023). Upsampling is handled by spike-based transposed convolution.
Residual/Shortcut Connections: CSPNet and lightweight CSP-inspired residuals (SU-Block1/2 in SU-YOLO) mitigate spike degradation. Only partially processed spikes traverse the deeper branch, maintaining moderate firing rates in deep layers (Li et al., 31 Mar 2025).
Separated Batch Normalization (SeBN): In SU-YOLO (Li et al., 31 Mar 2025), batch normalization is implemented per time-step and channel, preserving temporal distribution properties and enabling parameter fusion into weights/biases before spike-only deployment.
Denoising Preprocessing: Underwater SNNs utilize lightweight, integer-only spatial denoising on the first spike map to remove isolated spike errors with negligible cost (Li et al., 31 Mar 2025).

5. Output Decoding, Losses, and Training Paradigm

The output decoding in all SpikeYOLO architectures strictly follows YOLO conventions:

Detection Head Decoding: Firing rates (ratio of spikes or accumulated membrane) are mapped via deterministic transforms: sigmoid for $t_x, t_y$ and objectness, exponential for $t_w, t_h$ , softmax for class logits (Qu et al., 2023). Postprocessing includes thresholding by objectness and non-maximal suppression (NMS).
Loss Functions: Training uses standard YOLO loss terms—bounding-box regression (IoU, CIoU, or GIoU), binary cross-entropy for objectness, and cross-entropy for classification. In I-LIF-based SpikeYOLO, no additional loss terms are needed for integer quantization (Luo et al., 2024).
Optimization: Backpropagation through time (BPTT) is supported by surrogate gradients, typically piecewise-linear, for the non-differentiable spike function.

6. Performance, Energy-Efficiency, and Application Scope

SpikeYOLO models deliver substantial reductions in inference latency (in terms of timesteps T) and energy usage over both ANN-YOLO and earlier SNN-YOLO models:

Latency and mAP: SUHD(Qu et al., 2023) achieves 75.3% [email protected] on PASCAL VOC and 54.6% on MS COCO in just T=4 timesteps, with performance within 0.1–0.2% mAP of the YOLOv5 ANN baseline. The meta SNN block version (Luo et al., 2024) achieves 66.2% mAP@50 and 48.9% mAP@50:95 on MS COCO at T=1,D=4, outperforming prior SNNs by 15–18 mAP points.
Energy Efficiency: Energy models indicate ≳200× improvement over comparable ANNs (Qu et al., 2023), and 5.7× relative to ANN baselines of equal architecture in neuromorphic datasets (Luo et al., 2024), due to the dominance of spike accumulations (AC) over multiply-accumulate (MAC) operations.
Generalization and Dataset Suitability: Recent iterations generalize over static (e.g., COCO, PASCAL VOC) and neuromorphic event-driven datasets (e.g., Gen1, underwater scenes (Li et al., 31 Mar 2025)), with specially designed modules for noise removal or temporal normalization as needed.

7. Prospects, Limitations, and Extensions

SpikeYOLO's design trajectory demonstrates that carefully tailored SNN architectures, leveraging integer-state neurons, architectural simplification, customized normalization, and spike-efficient coding, can effectively match (or exceed) ANN detectors for vision tasks while retaining the hardware and energy advantages of spike-based computing.

A plausible implication is that future research may increasingly integrate hardware co-design, task-optimized surrogate gradient schemes, and richer event-driven sensor data to close the tiny remaining accuracy gap and further amortize energy and compute costs in robotic and mobile vision deployments. Extensions that support native neuromorphic input (events rather than frame duplication) and those that incorporate task-driven neuron model adaptations or dynamic resource scaling are emerging as key directions.

References: