Depth-in-the-Wild Dataset Overview
- Depth-in-the-wild datasets are collections that capture realistic depth cues in unconstrained environments using both synthetic photorealism and real-world annotations.
- These datasets integrate metric, relative, and amodal depth modalities, enabling comprehensive evaluations through methodologies like game-engine renders, crowdsourced annotations, and multi-view reconstructions.
- Advances in these datasets address challenges such as the synthetic-to-real domain gap and sparse annotations, thereby improving generalizable depth estimation across varied imaging conditions.
Depth-in-the-wild datasets provide supervisory signals for monocular or multi-view depth estimation under unconstrained, realistic imaging conditions. They systematically address the limitations of conventional RGB-D corpora, which are typically confined to restricted lighting, scene types, or modalities, by capturing or synthesizing depth ground-truth across diverse environments and acquisition pipelines. Recent advances encompass synthetic photorealistic game renders, large-scale web image annotation protocols, algorithmic composition for amodal (occlusion-aware) depth, and real omnidirectional/360° capture with multi-view or structure-from-motion depth.
1. Dataset Taxonomy and Core Modalities
The term "depth-in-the-wild" spans distinct modalities and task-specific operationalizations:
- Metric Depth: Every pixel is assigned an absolute distance value, typically in meters or normalized sequence units. This is standard for most synthetic datasets, some multiview reconstructions, and gaming-engine buffers.
- Relative (Ordinal) Depth: Only depth orderings (e.g., point i closer than point j) are annotated, often via crowdsourcing. This enables scaling over unconstrained real images where metric sensors are infeasible.
- Amodal Depth: Supervisory signals cover both visible and occluded (hidden) regions, requiring the model to hallucinate plausible geometry for occluded surfaces.
A representative summary of major datasets:
| Dataset | Modality | Domain | Scale |
|---|---|---|---|
| NYU Depth v2 | RGB-D (metric) | Indoor | 1.4 K |
| KITTI, Make3D | RGB-D (metric) | Driving/out | 10–50 K |
| DIW [Chen et al. 2016] | RGB + relative | Mixed | ≈495 K |
| GfD [Playing for Depth] | Synthetic RGB-D | GTA V | 200 K |
| ADIW | Real RGB + masks | “In the wild” | 564 K |
| 360° in the Wild | RGB-D (pseudo) | 360° capture | 25 K |
These datasets, each with distinct annotation protocols and ground-truth representations, enable benchmarking and training for a range of depth learning paradigms (Haji-Esmaeili et al., 2018, Chen et al., 2016, Li et al., 3 Dec 2024, Park et al., 27 Jun 2024).
2. Synthetic Photorealistic Depth: The GfD Dataset
The Grand Theft Auto V-derived dataset, often referred to as "GfD," operationalizes synthetic photorealism as a scalable platform for depth annotation (Haji-Esmaeili et al., 2018). GTA V's Northlight engine provides an open world of 100 with diverse interiors, lighting, and weather permutations. The data-collection protocol efficiently intercepts the in-game camera at full fidelity (up to 8K), capturing at the native game loop rate (~60 fps), before subsampling frames to maximize scene and environmental diversity.
Critical properties:
- Scene/Environment Diversity: Urban streets (≈40%), interiors (20%), rural/highways (15%), tunnels/underground (10%), special locations (15%), each under varied lighting/weather.
- Depth Buffer Extraction: DirectX-driver injection preserves per-pixel, noise-free, metric depth in world units, bypassing imaging artifacts like motion blur or DOF.
- Post-processing: Depth frames undergo histogram equalization, log transformation, and z-standardization to serve both metric and relative tasks, and to standardize across atmospheric and visual filters.
- Data Splits: 160,000/20,000/20,000 for training/validation/test, sampled uniformly over environments and temporal/weather cycles.
Despite photorealistic rendering, a nontrivial domain gap persists: lens distortion, sensor noise, and tonemapping artifacts in real sensors; mitigations via RGB standardization and randomized data augmentations reduce overfitting to video-game artifacts.
3. Large-Scale Human-Annotated Relative Depth: DIW
The "Depth in the Wild" (DIW) dataset (Chen et al., 2016) comprises 495,000 RGB images sampled from Flickr to maximize contextual and visual diversity, annotated with a single relative depth pair per image. Annotation is performed via Amazon Mechanical Turk, employing dual-worker redundancy for each pair and gold-standard verifications for quality control, yielding <1% annotation noise.
Annotation protocol:
- Pair Types: Split between unconstrained (random locations) and symmetric (mirror about center, same scanline) point pairs to break image-geometry priors.
- Relative-Only Supervision: The dataset does not provide absolute or even dense ordinal maps per image; rather, it amasses massive diversity via breadth of unique pairwise constraints.
Learning paradigm:
- Ordinal Ranking Loss:
For point-pair with label and output depths :
No explicit metric scale supervision; combination with small metric RGB-D sources (e.g., NYU Depth) can calibrate the model’s depth scale.
Performance:
- On DIW, models trained on combined metric+relative supervision achieve WHDR (Weighted Human Disagreement Rate) ≃ 14.4%, outperforming pure RGB-D trained models (Chen et al., 2016).
4. Real-World Amodal Depth from Compositing Pipelines: ADIW
The Amodal Depth in the Wild (ADIW) dataset (Li et al., 3 Dec 2024) systematizes occlusion-aware depth estimation—assigning depth values, up to a scale, to both visible and occluded segments of objects in unconstrained natural photographs.
Construction protocol:
- Mask Extraction: Leverages SA-1B segmentation [Kirillov et al. 2023] filtered by Pix2Gestalt to isolate whole-object masks.
- Image Compositing: Random foreground objects are composited onto random backgrounds, producing synthetic occlusions.
- Relative Depth via Pre-trained Models: Depth Anything V2 is run on both background and occluded images, producing normalized depths .
- Scale-and-Shift Alignment: Solve for scale and shift aligning background and occluded depths over visible regions,
then construct ground-truth amodal depth .
- Annotations: Each image is assigned (composite RGB), (visible depth), (amodal mask), (amodal depth).
Statistics:
- 564,000 train/val samples (one occlusion event per image), grouped by visible ratio into easy, medium, and hard occlusion categories. No final test split, but 4,000 sample validation set for cross-dataset evaluation.
Distinctive properties:
- First large-scale, real-image, occlusion-depth corpus with relative annotation.
- Emphasizes occluded-region geometry, not metric accuracy.
5. 360° Omnidirectional Depth in Unconstrained Scenes
The "360° in the Wild" dataset (Park et al., 27 Jun 2024) provides dense pseudo-ground-truth for 360° panoramic scenes captured worldwide, offering 25,000 panoramas from 273 video sequences scraped from YouTube. Approximately 11,300 frames have reliable multi-view structure-from-motion (SfM) + multi-view stereo (MVS) depth maps, backed by OpenSfM and COLMAP pipelines. Sequence splits (indoor, outdoor, mannequin) are performed at the sequence level for non-overlapping train/val/test sets.
Core pipeline:
- Panorama and Pose Registration: Camera extrinsics are inferred for each panorama. Images are mapped to six cubemap faces and fed into COLMAP for dense MVS, then reprojected to equirectangular format.
- Depth Normalization: Depth is normalized per-sequence (scale is ambiguous due to SfM).
- Moving Object Masks: Binary masks flag dynamic or intrusively positioned elements, enabling training or evaluation under motion exclusion.
Utility:
- Supports single-view omnidirectional depth, view synthesis (with NeRF++ spherical extensions), and self-supervised learning in highly diverse real-world contexts.
6. Evaluation Metrics and Benchmarking Protocols
Depth-in-the-wild datasets are typically evaluated with error metrics sensitive to both absolute and relative performance. Established metrics include:
- Absolute relative error:
- Root-mean-square error (RMSE):
- RMSE (log):
- Scale-invariant RMSE (Eigen et al.):
- Squared relative error:
- Thresholded accuracy: For , percentage of pixels with ().
For ordinal/relative datasets, Weighted Human Disagreement Rate (WHDR) or Weighted Kinect Disagreement Rate (WKDR) is used, measuring point-pair ordering agreement.
7. Impact, Limitations, and Future Directions
Depth-in-the-wild datasets have advanced monocular depth architectures, particularly in generalization to unconstrained scenes. Models trained solely on GfD, for example, outperform real-RGB-D trained models on NYU v2 and exhibit strong qualitative performance on outdoor datasets (DIW, KITTI, Make3D), with substantial reduction in "texture-copy" artifacts and improved occlusion reasoning (Haji-Esmaeili et al., 2018).
However, challenges remain:
- Synthetic-to-real gap: Game-engine renders lack true sensor artifacts (noise, optical blur), and domain gap persists despite data augmentation. A plausible implication is that injecting calibrated sensor models or multi-game engines could further close this gap.
- Annotation Sparsity (Relative Depth): Single pairwise orderings per image, as in DIW, offer breadth but not local geometric detail. Dense annotations or compositional pipelines (as in ADIW) address this partially.
- Metric Scale Ambiguity: SfM/MVS-based pseudo-ground-truth (e.g., 360° in the Wild) yields scale-ambiguous depths; downstream usage must normalize or recalibrate metric outputs.
- Amodal Limitations: Mask errors and shape detail losses affect occlusion-depth annotation. Joint prediction of mask, depth, and inpainting, possibly augmented with human-annotated masks, is recommended (Li et al., 3 Dec 2024).
Areas for expansion include multi-game synthetic depth for stylistic/biome diversity, omnidirectional capture for novel-view synthesis, and joint annotation of normals/semantics for multi-task and self-supervised architectures. The release of large, diverse, and well-annotated depth-in-the-wild corpora is expected to remain central to robust, generalizable depth perception and understanding.