DeepSense6G: Multi-Modal 6G Benchmark
- DeepSense6G is a comprehensive multi-modal dataset that synchronizes mmWave, LiDAR, cameras, radar, and GPS data to support 6G and ISAC research.
- It offers over 1.08 million curated snapshots from diverse real-world scenarios, enhanced by rigorous calibration and structured preprocessing.
- The dataset benchmarks methods in beam prediction, blockage inference, and localization, fueling advances in autonomous driving and V2X communications.
DeepSense6G is a large-scale, real-world multi-modal sensing and communications dataset created to accelerate research at the intersection of 6G wireless, integrated sensing and communication (ISAC), and domain-adapted machine learning. The dataset offers time-synchronized measurements from multiple sensor modalities—mmWave (millimeter-wave) and sub-6 GHz radio frequency (RF), LiDAR, cameras, radar, and GPS—collected under realistic conditions and covering diverse scenarios such as vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), drone, and fixed wireless environments. It is a foundational resource for benchmarking and developing deep learning models for tasks including beam prediction, blockage inference, environmental perception, and precise localization in next-generation wireless systems (Alkhateeb et al., 2022).
1. Dataset Objectives and Composition
DeepSense6G was designed to provide a comprehensive, high-fidelity dataset capturing the statistical interdependencies between wireless channels and environmental context. The dataset enables research in ISAC and sensing-aided communication by making simultaneous, co-located measurements spanning radio, LiDAR, vision, and navigation sensors. Its key objectives are:
- Supporting robust deep learning modeling across a spectrum of wireless sensing and positioning tasks.
- Enabling the investigation of cross-modal fusion schemes that leverage channel statistics, spatial geometry, and semantic object context.
- Facilitating the evaluation and cross-comparison of models under conditions that include line-of-sight (LOS), non-line-of-sight (NLOS), diverse traffic densities, lighting, and weather (Alkhateeb et al., 2022).
The full dataset comprises over 1.08 million synchronized “snapshots” from more than 40 scenarios, collected at fifteen distinct locations in the USA and Spain. Each scenario encompasses multiple hours of synchronous capture, with raw data volumes totaling over 5 TB.
2. Sensor Modalities and Technical Specifications
Every DeepSense6G deployment employs a multi-sensor suite rigidly mounted with hardware-calibrated extrinsics. Common modalities include:
- mmWave Communications: 60 GHz, 16-element phased array, 64-beam codebook sweeping 90° field of view at 10 Hz; channel matrix sampled per sweep.
- Sub-6 GHz MIMO (in select testbeds): Up to 32×32 antenna arrays, sampled at 1–5 Hz.
- LiDAR: 3D LiDAR (e.g., Ouster OS1-32, 32×1024 beams, 120 m range, 20 Hz) and 2D LiDAR (e.g., Hokuyo UST-10LX, 270°, 40 Hz).
- Visual: Stereo or monocular cameras (e.g., StereoLabs ZED2, 1920×1080, 30 fps downsampled to 10 Hz).
- Radar: FMCW units (76–81 GHz, 10 Hz).
- GPS-RTK: 10 Hz, centimeter-level accuracy.
- IMU: (where present) 6-axis, 100 Hz.
Sensor data is aligned via UTC timestamps with total timebase drift below 5 ms. Spatial calibration uses standard hand–eye transforms:
where is a point in sensor coordinates, is in the scenario base frame, and .
3. Scenario Definitions and Data Collection Workflow
DeepSense6G employs modular testbeds—combinations of sensor units configured for specific deployment scenarios. Each scenario is defined by:
- Objective (beam prediction, blockage, object detection, etc.)
- Unit placement and FoV planning
- Real-world context (urban traffic, suburban roads, indoor corridors, RIS backscatter, etc.) (Alkhateeb et al., 2022)
Data collection involves continuous, synchronous logging on all sensors, with post-hoc alignment and verification using custom GUIs. Per-sample filtering ensures only groups where all sensors have valid readings (e.g., within FoV, GPS lock) are admitted to the dataset. Meta-data files (JSON, CSV) provide a searchable index of scenario context, ground-truth positions, and sample quality tags.
A summary of sensor usage in one representative scenario (“Scenario 36” used in ENWAR) is as follows (Nazar et al., 8 Oct 2024):
| Modality | Description |
|---|---|
| GPS | WGS84 lat/long; 4 receivers @ Unit 1; 1 @ Unit 2 |
| LiDAR | 360° point cloud; SFA3D → object positions |
| Camera | Front + rear RGB images; InstructBLIP → captions |
| Scenes Total | 180 manually curated |
| Test Scenes | 30 |
| Train/Val | 150 (shared) |
| Modality Comb. | GPS; LiDAR; Cam; GPS+LiDAR; GPS+Cam; LiDAR+Cam; GPS+LiDAR+Cam |
4. Processing, Annotation, and Data Splits
Raw sensor data is processed via unified pipelines:
- Timestamp resampling to a 10 Hz grid (unless otherwise specified).
- Noise and outlier filtering (e.g., removing LiDAR sweeps with 1,000 points).
- Structured preprocessing: Standard detectors transform low-level sensor data into semantic tokens (e.g., SFA3D on LiDAR; InstructBLIP on camera images).
In scenario-specific applications, such as the ENWAR framework (Nazar et al., 8 Oct 2024), manual inspection and correction is performed to build ground-truth textual templates, identifying missing object types and annotating blockage events.
Dataset splits generally follow a 70/15/15 (train/validation/test) allocation per scenario (Alkhateeb et al., 2022). In ENWAR, "Scenario 36" employs 180 curated snapshots, with 30 test scenes (~17%) and 150 for training/knowledge-base construction. No explicit train/validation subdivision is applied, and each scene is a single synchronous capture without temporal sequence data (Nazar et al., 8 Oct 2024).
5. Applications and Benchmarking
DeepSense6G enables a wide range of ISAC and vision-wireless ML research:
- Beam Prediction: Integrates visual and positional context to select optimal mmWave beam, optimizing:
where is the true beam index (Alkhateeb et al., 2022).
- Blockage Inference: Uses past radar and RF measurements to anticipate channel blockages as a binary sequence classification.
- Position Estimation: Multimodal learning frameworks fuse mmWave, LiDAR, and GPS for precise user equipment (UE) location recovery.
- Object Detection/Scene Understanding: Standard methods operate on LiDAR and RGB modalities to detect vehicles, pedestrians, cyclists, etc.
In the ENWAR study, seven modality fusion schemes—spanning all GPS, LiDAR, and camera permutations—are input to a retrieval-augmented multi-modal LLM. Each modality is transformed into structured text; chunks are embedded via gte-large-en-v1.5 and indexed (FAISS). The system retrieves semantically relevant context for inference, validating richer spatial and scene understanding compared to non-domain-adapted LLMs. Key ENWAR performance metrics are up to 70% relevancy, 55% context recall, 80% correctness, and 86% faithfulness (Nazar et al., 8 Oct 2024).
6. Dataset Structure, Access, and Licensing
DeepSense6G is organized as scenario-based directories, each containing:
- Metadata (scenario context, sensor file paths)
- Synchronized data “groups” (image, LiDAR point cloud, radar, mmWave vector, GPS)
- Python utilities for data loading, projection, and visualization
Recommended experimental guidelines are provided: combine scenarios for robust training, avoid artificially splitting time-correlated data groups, and conduct cross-site zero-shot evaluations to assess domain generalization (Alkhateeb et al., 2022).
Licensing is academic (CC BY-NC-SA-4.0-inspired); commercial use requires written permission. Proper citation of foundational works is mandated in downstream publications.
7. Significance, Limitations, and Research Impact
DeepSense6G is the first dataset of its scale to unify wireless, sensing, and environmental data for the development of robust, generalizable deep learning models in 6G and beyond. It supports reproducible benchmarking for ISAC, enables evaluation of environment-aware LLMs such as ENWAR on real-world, multi-modal captures, and provides a rich substrate for fusion strategies critical to autonomous driving, V2X networks, and RIS-enabled deployments.
However, certain scenario-specific releases (e.g., Scenario 36 in ENWAR) lack full disclosure of granular hardware details (sampling rates, camera parameters); such information must be sourced from the primary dataset documentation. Each manually curated scene in ENWAR is a single timestamp rather than a continuous sequence, suggesting some limitations for temporal modeling. A plausible implication is that while the dataset enables detailed spatial and semantic perception research, sequence-learning or forecasting tasks may require augmentation with full multi-frame or trajectory data from the core DeepSense6G release (Nazar et al., 8 Oct 2024, Alkhateeb et al., 2022).