DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR

Published 4 Apr 2026 in cs.CV | (2604.03685v1)

Abstract: In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces DSERT-RoLL, a comprehensive dataset integrating stereo event, RGB, thermal cameras, 4D radar, and dual LiDAR to benchmark robust 3D detection in varied environmental conditions.
It details a confidence-gated fusion architecture that adaptively merges multimodal data, achieving AP over 90 in clear weather and sustaining performance in adverse conditions.
Results underscore sensor redundancy and calibration robustness, paving the way for reliable autonomous driving in extreme weather and lighting scenarios.

Motivation and Background

Autonomous driving demands highly robust perception systems capable of consistent operation across diverse and adverse environments. While extensive datasets exist for RGB and LiDAR modalities, current research confronts limitations in extreme conditions such as heavy rain, snow, fog, low illumination, and high dynamic range. These environments frequently degrade the reliability of conventional sensors, resulting in incomplete coverage or corrupted data, severely impacting detection performance.

Recent advances in emerging sensors—including event cameras, thermal imagers, and 4D Radar—offer unique physical signal properties that are promising for overcoming these environmental failures. However, systematic, direct comparison and fusion of these modalities with established sensors have remained largely unexplored due to the lack of a comprehensive, unified dataset and fair benchmarks.

DSERT-RoLL Dataset Design

The DSERT-RoLL dataset (“Driving with Stereo Event-RGB-Thermal cameras, 4D Radar, and Dual LiDAR”) is constructed to fill this critical gap. It incorporates:

Stereo-based Event, RGB, and Thermal Cameras: All three modalities with stereo pairs, providing robust complementary visual information.
4D mmWave Radar: Provides range, azimuth, elevation, and Doppler measurements, remaining robust through strong precipitation and severe visual degradation.
Dual LiDAR Systems: Short- and long-range LiDARs combined for high-density geometric coverage and distant object detection.
Comprehensive Environmental Coverage: Data is collected in a wide range of weather (clear, fog, light/heavy rain and snow) and lighting (normal, low-light, overexposed, HDR) scenarios for urban, suburban, and highway driving scenes.
Synchronized, High-Fidelity Annotations: High-quality 3D and 2D bounding box annotations, per-frame object tracking IDs, and ego vehicle odometry.

Through its structure, DSERT-RoLL enables direct, fair modality comparison and multi-modal fusion experiments, facilitating the analysis of individual sensor strengths, weaknesses, and complementary behaviors.

Figure 1: Complementary strengths of sensor modalities—LiDAR excels in clear conditions, 4D radar dominates in adverse weather, RGB in normal daylight, event cameras in dynamic motion or HDR, and thermal images in low-light or night.

Figure 2: Training and test splits are balanced in terms of weather, lighting, and object class distribution, ensuring benchmarking reliability.

DSERT-RoLL enables principled fusion research, and the paper proposes a strong multi-modal fusion architecture for robust 3D object detection:

Voxel-Based BEV Backbone: LiDAR and 4D Radar point clouds are voxelized and processed with a shared 3D backbone. BEV (Bird’s Eye View) features are generated and fused, exploiting geometric complementarity.
Camera-Informed Voxel Features: RGB, event, and thermal images are processed by modality-specific 2D backbones, then projected into the 3D grid using calibrated transformations via a voxel-centric, deformable sampling scheme.
Confidence-Gated Fusion: Voxel features from all modalities are adaptively reweighted based on data-driven confidence estimations, ensuring down-weighting of unreliable or noisy modality data under adverse conditions.
Two-Stage Bounding Box Refinement: Initial proposals from range sensors are refined using fused camera-augmented features, increasing robustness to modality failures.
Figure 3: Overview of the fusion pipeline: LiDAR/4D Radar features are combined for initial proposals; image features (RGB, thermal, event) are projected into 3D space and fused; camera and range data are adaptively combined for final prediction.

Sensor Calibration and Annotation

Precise cross-modal calibration is critical. The methodology employs:

Joint Intrinsic/Extrinsic Calibration: Calibration pipelines for RGB/event/thermal/LiDAR/Radar employ direct or homography-based mappings; custom targets are used for modalities with limited visible landmarks.
Multi-Modal Annotation Interfaces: Dedicated tools assist annotators in leveraging geometric and appearance data simultaneously, increasing annotation accuracy in challenging weather/lighting (see Figure 4).
Privacy Considerations: All privacy-sensitive elements—faces, license plates—are actively blurred.
Figure 5: Precise cross-modal calibration enables accurate spatial alignment of LiDAR, Radar, and all camera modalities for consistent annotation and fusion.

Benchmarking and Results

3D Object Detection

Detection performance is evaluated on a comprehensive protocol, with results demonstrating:

LiDAR Vulnerability: LiDAR achieves high AP in clear conditions (82.90+) but degrades substantially in heavy fog (down to 54.14 AP).
4D Radar Resilience: 4D Radar shows consistent geometric reliability in adverse weather, compensating for LiDAR failures.
Cameras: RGB dramatically degrades under night or HDR; event cameras excel in rapid motion/HDR; thermal excels in low-light.
Multi-Modal Synergy: Progressive fusion (R+L, 4R+L, R+4R+L, R+E+T+4R+L) yields monotonically increasing AP, with full fusion (all five: R+E+T+4R+L) reaching the highest robustness in all conditions (AP exceeding 90 in clear and >70 in adverse weather), clearly exceeding all strong baselines.

Bold result: Integrating event and thermal modalities alongside conventional sensors provides non-trivial boosts in robustness and accuracy—these gains are especially marked in low-light, fog, and heavy precipitation.

Figure 6: Qualitative comparison in BEV. The full fusion model outperforms Radar- and LiDAR-only models, maintaining object coverage and size precision in difficult settings.

2D Detection

The dataset supports multi-modal 2D detection with similarly strong findings—multi-modal architectures (e.g., GM-DETR on R+E+T) substantially outperform single-modality models under adverse lighting and weather.

Practical and Theoretical Implications

Sensor Failure Redundancy: Direct evidence that complementary sensors cover one another’s failure modes, enabling reliable safety-critical perception in real-world conditions.
Fusion Architecture Generality: The proposed confidence-based fusion model is extendable to any sensor set, supporting rapid prototyping for new platforms.
Calibration Robustness: Evaluation reveals that detection performance is robust to moderate calibration errors (e.g., 5 cm/5° perturbations cause <2 AP drop), pointing to strong practical deployability.
Future Extensions: DSERT-RoLL provides a foundation for domain adaptation, robust domain generalization, sensor selection, and cost-performance analysis for automotive ADAS and robotic platforms. It enables new research directions in self-supervised, adaptive, and redundancy-aware perception architectures.

Conclusion

DSERT-RoLL constitutes a comprehensive, rigorously benchmarked multi-modal perception dataset designed for the direct evaluation and development of robust 3D and 2D object detection systems under diverse and challenging driving conditions. Its design enables detailed ablation and fusion research, and the reported results demonstrate that the adaptive fusion of event, thermal, radar, LiDAR, and RGB systems yields superior resilience compared to prior approaches. This work will accelerate advances in robust all-weather, all-lighting scene understanding for real-world autonomous and assisted driving (2604.03685).

Markdown Report Issue