VAA-KITTI: 3D OoD Benchmark
- VAA-KITTI is a dataset designed for benchmarking out-of-distribution semantic occupancy prediction in 3D urban scenes using LiDAR and image data.
- It employs a three-phase, physics-guided anomaly synthesis pipeline—combining 2D inpainting, depth alignment, and voxelization—to accurately localize synthetic anomalies.
- The dataset supports robust evaluation through metrics like IoU, mIoU, AuROC, and AuPRC, aiding advanced research in autonomous driving safety and detection.
VAA-KITTI is a dataset designed for benchmarking out-of-distribution (OoD) semantic occupancy prediction in 3D urban scenes. It provides a physically plausible, richly annotated collection of 500 LiDAR+image frames infused with synthetic anomalies, supporting systematic evaluation of both semantic 3D occupancy and voxel-level OoD detection pertinent to autonomous driving research (Zhang et al., 26 Jun 2025, Zhang et al., 1 Apr 2026).
1. Origins and Data Foundations
VAA-KITTI is directly based on the SemanticKITTI dataset, which sources its data from KITTI odometry sequences 00–10. The sensor suite includes a Velodyne HDL-64E rotating LiDAR (64 scan lines, 10 Hz) and a single front stereo camera (image size 1226×370). Ground-truth pose and semantic annotations covering 19 classes plus free space are annotated in the LiDAR coordinate system (X forward, Y left, Z up).
VAA-KITTI preserves the original training/validation set allocation from SemanticKITTI, introducing no synthetic anomalies into these splits. Synthetic anomalies are restricted to a dedicated “test” partition drawn from sequences 07, 09, and 10, ensuring no data leakage into training (Zhang et al., 26 Jun 2025).
2. Synthetic Anomaly Integration Pipeline
The central innovation in VAA-KITTI is a three-phase, physics-guided anomaly synthesis pipeline orchestrated as follows:
Phase 1. 2D Anomaly Generation:
Anomalous objects are generated in the image domain by POC inpainting, a semantic-aware instance inpainting method that produces realistic out-of-distribution objects and their corresponding pixel masks. Multiple anomaly categories are possible per frame.
Phase 2. Depth Alignment & 3D Localization:
Pseudo-depth prediction (Depth-Anything V2) produces per-pixel depth maps for both the original and synthesized images. Real LiDAR points are sampled and projected into the image plane, after which a support vector regression (SVR) model maps predicted pseudo-depths to true LiDAR depths . This mapping enables the back-projection of each 2D anomaly mask pixel to a 3D point in the camera frame , thus localizing anomalies in 3D.
Phase 3. Voxelization & Occlusion Enforcement:
Anomaly 3D points are transformed into LiDAR/voxel coordinates. For each anomaly voxel, a ray-marching algorithm is performed from the camera center toward the anomaly point:
Voxels along the march are marked as occupied if they lie within step , thereby enforcing physically plausible spatial occlusion patterns (Zhang et al., 26 Jun 2025).
3. Taxonomy of Anomalous Objects and Dataset Statistics
VAA-KITTI features 26 distinct out-of-distribution classes grouped into five super-categories:
- Animals (e.g., deer, dog, cow)
- Traffic/accessories (e.g., cones, barrels)
- Furniture & litter (e.g., chairs, trash bags)
- Machinery/tools (e.g., construction equipment)
- Miscellaneous obstacles (e.g., large stones, pallets)
Over 30,000 images with POC-generated synthetic content were manually filtered to yield a final set of 500 test frames. The per-class instance count ranges from a few dozen to ~200, and the overall anomaly-to-background voxel ratio is approximately 0.5–1%. This selection ensures an ample range of anomaly types for robust evaluation, while balancing the rarity of OoD events inherent to real-world applications (Zhang et al., 26 Jun 2025, Zhang et al., 1 Apr 2026).
4. Voxel Grid Construction and Semantic Annotation
The canonical voxel grid covers m forward, m lateral, and m vertical, discretized as a grid with cell resolution 0.2 m. LiDAR point clouds are voxelized and assigned semantic labels 0 for in-distribution classes; anomaly voxels (determined via phase 3 ray-marching) are given an OoD label (19). The result is a 20-class voxel-wise annotation (19 in-distribution, 1 OoD).
Data is distributed in the following directory structure:
8
Each .npz contains three arrays (shape: 1): occupancy (bool), semantics (uint8: 0–18 in-distribution, 19=OoD), anomaly_mask (bool). Python scripts are provided for loading, voxelization, label lookup, and BEV/3D visualization (Zhang et al., 26 Jun 2025).
5. Dataset Splits, Size, and Use
VAA-KITTI strictly separates normal data (train/val) from OoD-containing test data. The split is as follows:
| Split | Frames (approx.) | OoD frames | Normal frames |
|---|---|---|---|
| Train | ≈45,406 | 0 | ≈45,406 |
| Validation | ≈4,071 | 0 | ≈4,071 |
| Test | 500 | 500 | 0 |
Test frames originate from KITTI odometry sequences 07, 09, and 10. No synthetic anomalies are present in the training or validation sets, ensuring evaluation of OoD detection and semantic occupancy in genuine novelty scenarios (Zhang et al., 26 Jun 2025, Zhang et al., 1 Apr 2026).
6. Evaluation Protocols and Metrics
VAA-KITTI supports both 3D occupancy/semantic completion and OoD detection evaluation:
- Occupancy/Semantic Completion:
Intersection-over-Union (IoU) and mean IoU (mIoU) computed as in Song et al., CVPR 2017.
2
3
- OoD Detection Metrics:
- True Positive Rate (TPR) and False Positive Rate (FPR)
- Area under the ROC curve (AuROC)
- Regionally tolerant Area under the Precision–Recall Curve (4), computed after morphological dilation of the anomaly mask by radii 5 m (4, 5, 6 voxels) to reduce impact of small spatial misalignments.
- Metrics are reported within 1.2 m of each anomaly, focusing sensitivity on the region adjacent to anomalous objects (Zhang et al., 26 Jun 2025, Zhang et al., 1 Apr 2026).
Example quantitative results:
On VAA-KITTI, methods such as ProOOD demonstrate mIoU improvements of +0.45–0.76 points (overall) and +0.56–0.63 for tail classes. For OoD detection, AuPRC6 increases by +17–19pp, reaching up to 27.9%, while AuPRC7 can surpass 62%. These results are statistically consistent across the 500 test frames (Zhang et al., 1 Apr 2026).
7. Impact, Use Cases, and Recommendations
VAA-KITTI has established itself as a prominent benchmark for urban 3D semantic occupancy models robust to distribution shift and rare object anomalies. The dataset's design addresses the inability of prior datasets to provide dense, precisely localized OoD ground truth at scale.
A key utility is evaluation of models that mitigate class-imbalance and anomaly absorption, exemplified by prototype-guided and hybrid detection models. ProOOD leverages VAA-KITTI for ablation and benchmarking, revealing that prototype-guided tail-class enrichment and local/global feature matching deliver substantial improvements in OoD detection and class calibration. For safety-critical deployments, it is recommended to integrate such lightweight plug-ins on top of existing occupancy architectures, utilize robust monocular depth estimation as needed, and monitor expected calibration error per class/geometry to ensure actionable OoD detection (Zhang et al., 1 Apr 2026).
VAA-KITTI is publicly available, complete with raw data, semantic ground-truth, anomaly masks, and utility scripts, supporting ongoing research in 3D perception for autonomous vehicles (Zhang et al., 26 Jun 2025, Zhang et al., 1 Apr 2026).