DSEC Spike: Neuromorphic Driving Benchmark

Updated 26 November 2025

DSEC Spike is a neuromorphic benchmark defined for object detection using simulated spike-camera data with microsecond temporal resolution and ultra-high dynamic range.
The dataset comprises 45 driving sequences and 132,900 annotated bounding boxes, enabling precise evaluation under varied illumination and motion conditions.
Baseline evaluations show that the EASD method outperforms both RGB+event and spike-only techniques by up to 11.2 mAP points with latency below 10 ms per frame.

DSEC Spike is a large-scale, driving-oriented benchmark for object detection using simulated spike-camera data, offering microsecond temporal resolution and ultra-high dynamic range. It represents a critical step in establishing neuromorphic perception methodologies tailored for autonomous driving scenarios where conventional frame-based and event-based sensors face intrinsic limitations under high-speed motion and challenging illumination conditions.

1. Spike Camera Principles and Data Simulation

Spike cameras operate on a per-pixel asynchronous integrate-and-fire scheme, in which each photosensor accumulates incident light as photocurrent and emits a spike whenever the integrated membrane potential surpasses a global threshold. The model governing the membrane potential for pixel $n=(x_n, y_n)$ is

$\frac{\mathrm{d}v_n}{\mathrm{d}t} = \lambda\,I(x_n, y_n, t), \quad v_n(t_r^+) = 0,$

where $I$ is scene irradiance and $\lambda$ is the photoelectric gain. Upon reaching threshold $\theta$ , a spike is emitted and the potential is reset: $v_n(t) \geq \theta \Longrightarrow \begin{cases} \text{emit spike at } t, \ v_n(t^+) \leftarrow v_n(t) - \theta. \end{cases}$ For simulation over fixed intervals $\Delta t$ ,

$\int_{t}^{t+\Delta t} \lambda\,I(x_n, y_n, \tau)\,\mathrm{d}\tau \geq \theta \implies \mathrm{spike} = 1.$

Key parameters used in SpikingSim include $\Delta t = 25\ \mu$ s, $\theta = 1.0$ , $\lambda \approx 0.1$ , dynamic range $\approx120$ dB, and noise (Poisson and Gaussian components). Applying this simulation to the original DSEC-Detection dataset produces time-aligned streams of 0/1 spikes per pixel, maintaining the integrity of the driving scenarios and annotation labels (Liu et al., 19 Nov 2025).

2. Dataset Structure and Annotation Protocol

DSEC-Spike contains 45 driving sequences, totalling approximately 10 hours and 132,900 annotated bounding boxes (cars, pedestrians, large vehicles). Spatial resolution is $640 \times 480$ pixels per camera, with average spike rates at $\sim2\times10^8$ spikes/s per camera ( $\sim170$ Hz per pixel). Scenarios span urban (day/night, stop-and-go), suburban, and highway environments, with annotations in COCO JSON format:

image_id (frame timestamp)
file_name (reference)
annotations: bbox $[x, y, \text{width}, \text{height}]$ , category_id $\in \{1,2,3\}$

Splits follow DSEC-Detection convention:

Train: 28 seq ( $\sim$ 60%, 80k boxes)
Val: 9 seq ( $\sim$ 20%, 26k boxes)
Test: 8 seq ( $\sim$ 20%, 26k boxes)

Boxes are provided for every keyframe, ensuring precise correlation between spike streams and object locations. This design facilitates benchmarking under high temporal fidelity, diverse traffic scenarios, and varied illumination regimes (Liu et al., 19 Nov 2025).

3. Evaluation Protocols and Baseline Results

Evaluation on DSEC-Spike employs COCO-style [email protected] (mean Average Precision at IoU $\geq$ 0.5) and end-to-end latency metrics. The benchmark includes performance for event-only, RGB+event, and spike-based detection pipelines. Table 1 summarizes baseline results ([email protected]):

Modality	Method	[email protected]
Event-only	CAFR (ECCV’24)	12.0%
RGB+Event	FlexEvent (arXiv’24)	47.4%
	RENet (ICRA’23)	29.4%
	DRFuser (EAAI’23)	28.1%
Spike-only	VTTW+YOLO (AAAI’22)	40.0%
	VTII+YOLO (AAAI’22)	41.7%
	VTTI+RT-DETR (CVPR’24)	34.6%
	EASD (Ours)	52.9%

EASD achieves superior detection rates, outperforming the best RGB+event fusion baseline by +5.5 and the best spike-based baseline by +11.2 mAP points. Scenario-specific mAP for EASD: urban day (58.3%), urban night (49.7%), suburb (53.1%), highway (52.6%). Latency is reported $<$ 10 ms per frame on NVIDIA A100 (Liu et al., 19 Nov 2025).

4. Integration and Usage Guidelines

Data storage uses HDF5 format for the spike stream ([T/ $\Delta t$ , H, W], uint8) and COCO-style JSON for annotations. Recommended practices for implementation include:

Pre-caching spike volumes in 25 ms windows for I/O efficiency.
Employing the provided PyTorch Dataset class and optional time-range sampler for dynamic batching.
Fine-tuning detection backbones (YOLO, DETR) by integrating dual-branch EASD modules to exploit distinctive spike dynamics.

Licensing is CC BY-NC-SA 4.0 (data) and Apache 2.0 (code), accessible at https://github.com/PKU-NeuromorphicLab/DSEC-Spike (Liu et al., 19 Nov 2025).

5. Comparisons to Event Cameras and Neuromorphic Architectures

Spike cameras differ from event cameras in that their asynchronous integrate-and-fire operation retains static scene information and provides ultra-high dynamic range, while event cameras only react to pixel-wise intensity changes ( $\Delta I$ ). DSEC-Spike’s design accounts for the limitations of event sensors by retaining background semantics and enhancing robustness to both motion and illumination extremes.

Recent neuromorphic architectures for segmentation and tracking on DSEC leverage the unique properties of spike data:

SLTNet applies spike-driven convolution and transformer blocks with binary mask operations, yielding superior SNN-based semantic segmentation performance (47.91% mIoU, 114 FPS) on DSEC-Semantic (Zhu et al., 17 Dec 2024).
SpikeMOT utilizes Spike Response Model neurons and a Siamese SNN backbone for multi-object tracking on DSEC-MOT, achieving HOTA 52.5, DetA 49.5, AssA 55.7 (Wang et al., 2023). A plausible implication is that future models may further unify the detection, segmentation, and tracking paradigms using spike-native representations for driving-oriented tasks.

6. Limitations and Forward Directions

DSEC-Spike is simulated, not sensor-derived, and may lack some real-world noise phenomena (e.g., sensor cross-talk). The dataset currently includes only three categories (cars, pedestrians, large vehicles), with planned extensions for pedestrian sub-categories and traffic signs. Weather and occlusion simulation are future goals, along with richer multi-modal fusion (e.g., RADAR, LiDAR).

Spike-only detection demonstrates strong baseline performance, but integration with additional sensing modalities could enhance robustness. DSEC-Spike's high temporal fidelity and dynamic range make it poised for studying ultra-low-latency perception, potentially informing practical autonomous driving systems under extreme conditions (Liu et al., 19 Nov 2025).

7. Significance and Impact

DSEC Spike is the inaugural spike-camera benchmark for driving perception, filling a critical gap for ultra-fast, high-dynamic-range object detection research. It offers entirely new capabilities for evaluating SNN, transformer, and hybrid neuromorphic architectures—enabling quantitative comparisons of spike-based, event-based, and image-based methods under identical driving scenes. The open-source format and comprehensive annotation protocol invite further innovation and facilitate reproducible research on neuromorphic vision for real-world autonomous driving.