Virtual KITTI 2 Dataset
- Virtual KITTI 2 is a synthetic, photo-realistic dataset replicating KITTI sequences with comprehensive ground-truth for various computer vision tasks.
- It employs advanced rendering in Unity 2018 HDRP to simulate realistic lighting, weather, and camera scenarios that mirror real-world conditions.
- The dataset provides multiple modalities—including RGB, depth, segmentation, and optical/scene flow—to facilitate benchmarking for tracking, stereo, depth, and segmentation algorithms.
Virtual KITTI 2 (VKITTI2) is a photo-realistic, synthetic dataset designed as a proxy for the KITTI Tracking Benchmark. Developed in Unity 2018.4 LTS with High Definition Render Pipeline (HDRP) and advanced post-processing, VKITTI2 bridges the visual gap between simulated and real-world driving environments. It provides complete ground-truth for a diverse set of computer vision tasks, including object detection, tracking, stereo/multiview estimation, optical flow, scene flow, depth estimation, and semantic/instance segmentation, all within a consistent geometric and photometric framework (Cabon et al., 2020).
1. Dataset Structure and Scene Representation
VKITTI2 comprises five fully synthetic “clone” sequences, each replicating the camera trajectory, vehicle placement, and scene layout of KITTI tracking sequences {0001, 0002, 0006, 0018, 0020}. Two virtual cameras are included: camera 0 (identical to the original Virtual KITTI mono camera) and camera 1, offset by 0.532725 m to the right to create a stereo baseline. This exact replication at the geometric and photometric level ensures comparability to real-world benchmarks while maintaining full sensor and annotation control.
Scene variants are generated for each sequence, encompassing camera rotations (±15°, ±30° about the vertical axis) and diverse weather/lighting conditions (morning, sunset, overcast, fog, rain). The rendering pipeline features physics-based lighting, HDR reflections, volumetric fog, custom anti-aliasing, particle systems for precipitation, and post-processing (Unity’s Post-Processing Stack v2), significantly improving photo-realism.
2. Data Modalities and Annotations
VKITTI2 outputs, per camera and per frame, the following modalities in aligned 1242×375 resolution:
- RGB images (8-bit PNG)
- Depth maps (32-bit floating point, linear depth in meters)
- Class segmentation masks (8-bit indexed)
- Instance segmentation masks (16-bit instance IDs)
- Forward and backward optical flow (32-bit floating point)
- Forward and backward scene flow (three 32-bit floating-point channels)
Ancillary annotation files include text representations of camera intrinsics and extrinsics (calib.txt), object 2D bounding boxes and IDs (bboxes.txt), and complete 6-DoF vehicle pose matrices (poses.txt).
The intrinsic matrix follows the KITTI standard:
3D points are projected to pixel coordinates by . Scene flow vectors are provided in the camera frame, measuring the displacement of 3D points between frames.
3. Improvements over Virtual KITTI 1
VKITTI2 introduces several enhancements relative to its predecessor:
- Integration of the Unity 2018.4 HDRP, offering high-quality global illumination, physically-based shaders, and volumetric rendering
- Advanced anti-aliasing and post-processing for improved visual realism
- Introduction of a stereo camera pair, providing ground truth for stereo and disparity
- Generation of both forward and backward optical and scene flow
- Consistent geometry and camera trajectories with Virtual KITTI 1 and KITTI, but with expanded scene and sensor variants (weather, viewpoint)
These features significantly expand the utility of VKITTI2 for algorithmic evaluation across a wider range of real-world conditions.
4. Experimental Benchmarks and Results
VKITTI2 supports benchmarking of multi-object tracking (MOT), stereo matching, monocular depth and pose estimation, and semantic segmentation, using state-of-the-art methods under controlled photometric and geometric perturbations.
Multi-Object Tracking (DP-MCF)
The Faster R-CNN (ResNet-50-FPN) detector, pretrained on ImageNet→Pascal VOC→KITTI, serves as the backbone for the DP-MCF and MDP trackers. Evaluated with Bernardin & Stiefelhagen (2008) metrics—MOTA, MOTP, ID switches (I), fragmentation (F), mostly tracked (MT), and mostly lost (ML)—the results indicate highly comparable performance between real KITTI (MOTA 91.5%) and VKITTI2 clones (91.2%), with significant degradation under fog (MOTA -79.6%) and moderate degradation under rain or large rotations.
| Condition | MOTA | MOTP | MT | ML | P | R |
|---|---|---|---|---|---|---|
| clone | 91.2% | 82.6% | 79.7% | 1.7% | 99.8% | 92.9% |
| +15° rotation | -1.1% | -0.1% | -6.8% | +1.7% | +0.0% | -0.7% |
| fog | -79.6% | +3.5% | -78.6% | +75.5% | +0.2% | -73.5% |
| rain | -16.5% | -0.8% | -33.0% | +5.7% | -0.2% | -15.2% |
Stereo Matching (GANet)
A pre-trained GANet model (kitti2015_final) yields comparable endpoint error (EPE) and outlier rate on the real KITTI and VKITTI2 clones (EPE ≈1.64 px real vs 0.99 px VKITTI2, outlier ≈0.16% vs 0.13%), with pronounced degradation under fog or rain (EPE up to 2.42 px, outlier rate up to 0.33%).
Monocular Depth & Pose (SfmLearner)
Using SfmLearner pretrained on KITTI, depth estimation on VKITTI2 clones achieves near-identical results to real KITTI (absolute error: 4.18 m VKITTI2 vs 4.13 m real; threshold accuracy : 0.705 vs 0.711). Fog severely degrades performance (abs: 7.62 m; : 0.412). Pose estimation errors exhibit negligible difference (ATE: 0.0212 m VKITTI2 vs 0.0211 m real; RE: 0.0027 rad VKITTI2 vs 0.0028 rad real).
Semantic Segmentation (Adapnet++)
With Adapnet++ networks trained on Cityscapes, mAP on VKITTI2 (RGB) achieves 58.2% (clone), dropping to 43.4% (fog) and 46.3% (rain). Depth-only segmentation is invariant to weather (mAP clone: 38.7%).
5. File Organization and Usage
The dataset is organized by sequence (“sceneXX”), each containing subfolders per camera (“camera_0”, “camera_1”) and modality (e.g., “rgb/”, “depth/”, “flow_fwd/”), as well as text files for calibration and pose. All filenames use zero-padded six-digit indices for frame ordering, with PNG format for images, TXT for numerical data.
A plausible implication for reproducibility is that the tightly coupled geometric and photometric structure, along with comprehensive ground-truth, facilitates controlled ablation and transfer experiments for machine learning and computer vision research.
6. Applications, Limitations, and Future Directions
VKITTI2’s capabilities support robust evaluation, ablation, and domain adaptation research in visual perception for autonomous vehicles, under experimental setups where exact real-world ground truth is unavailable. Applications extend to detection, tracking, dense geometry, and segmentation algorithms, especially in the context of domain generalization to adverse conditions.
Limitations include a residual domain gap—especially under fog, heavy rain, and extreme viewpoint perturbations—where algorithm performance degrades substantially relative to ideal photometric conditions. The dataset currently includes only five clone sequences, with future plans to expand scene diversity, introduce dynamic traffic elements, LiDAR simulation, and explicit sensor noise modeling. Investigation of advanced domain adaptation techniques is highlighted as a priority for further narrowing the simulation-to-reality gap.
7. Availability and Reference
VKITTI2 is provided by Naver Labs Europe and is available for academic purposes at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds. The foundational work and experimental results are documented in Cabon et al., "Virtual KITTI 2" (Cabon et al., 2020).