Papers
Topics
Authors
Recent
2000 character limit reached

Depth-in-the-Wild Dataset Overview

Updated 10 December 2025
  • Depth-in-the-wild datasets are collections that capture realistic depth cues in unconstrained environments using both synthetic photorealism and real-world annotations.
  • These datasets integrate metric, relative, and amodal depth modalities, enabling comprehensive evaluations through methodologies like game-engine renders, crowdsourced annotations, and multi-view reconstructions.
  • Advances in these datasets address challenges such as the synthetic-to-real domain gap and sparse annotations, thereby improving generalizable depth estimation across varied imaging conditions.

Depth-in-the-wild datasets provide supervisory signals for monocular or multi-view depth estimation under unconstrained, realistic imaging conditions. They systematically address the limitations of conventional RGB-D corpora, which are typically confined to restricted lighting, scene types, or modalities, by capturing or synthesizing depth ground-truth across diverse environments and acquisition pipelines. Recent advances encompass synthetic photorealistic game renders, large-scale web image annotation protocols, algorithmic composition for amodal (occlusion-aware) depth, and real omnidirectional/360° capture with multi-view or structure-from-motion depth.

1. Dataset Taxonomy and Core Modalities

The term "depth-in-the-wild" spans distinct modalities and task-specific operationalizations:

  • Metric Depth: Every pixel is assigned an absolute distance value, typically in meters or normalized sequence units. This is standard for most synthetic datasets, some multiview reconstructions, and gaming-engine buffers.
  • Relative (Ordinal) Depth: Only depth orderings (e.g., point i closer than point j) are annotated, often via crowdsourcing. This enables scaling over unconstrained real images where metric sensors are infeasible.
  • Amodal Depth: Supervisory signals cover both visible and occluded (hidden) regions, requiring the model to hallucinate plausible geometry for occluded surfaces.

A representative summary of major datasets:

Dataset Modality Domain Scale
NYU Depth v2 RGB-D (metric) Indoor 1.4 K
KITTI, Make3D RGB-D (metric) Driving/out 10–50 K
DIW [Chen et al. 2016] RGB + relative Mixed ≈495 K
GfD [Playing for Depth] Synthetic RGB-D GTA V 200 K
ADIW Real RGB + masks “In the wild” 564 K
360° in the Wild RGB-D (pseudo) 360° capture 25 K

These datasets, each with distinct annotation protocols and ground-truth representations, enable benchmarking and training for a range of depth learning paradigms (Haji-Esmaeili et al., 2018, Chen et al., 2016, Li et al., 3 Dec 2024, Park et al., 27 Jun 2024).

2. Synthetic Photorealistic Depth: The GfD Dataset

The Grand Theft Auto V-derived dataset, often referred to as "GfD," operationalizes synthetic photorealism as a scalable platform for depth annotation (Haji-Esmaeili et al., 2018). GTA V's Northlight engine provides an open world of \sim100 km2\text{km}^2 with diverse interiors, lighting, and weather permutations. The data-collection protocol efficiently intercepts the in-game camera at full fidelity (up to 8K), capturing at the native game loop rate (~60 fps), before subsampling frames to maximize scene and environmental diversity.

Critical properties:

  • Scene/Environment Diversity: Urban streets (≈40%), interiors (20%), rural/highways (15%), tunnels/underground (10%), special locations (15%), each under varied lighting/weather.
  • Depth Buffer Extraction: DirectX-driver injection preserves per-pixel, noise-free, metric depth in world units, bypassing imaging artifacts like motion blur or DOF.
  • Post-processing: Depth frames undergo histogram equalization, log transformation, and z-standardization to serve both metric and relative tasks, and to standardize across atmospheric and visual filters.
  • Data Splits: 160,000/20,000/20,000 for training/validation/test, sampled uniformly over environments and temporal/weather cycles.

Despite photorealistic rendering, a nontrivial domain gap persists: lens distortion, sensor noise, and tonemapping artifacts in real sensors; mitigations via RGB standardization and randomized data augmentations reduce overfitting to video-game artifacts.

3. Large-Scale Human-Annotated Relative Depth: DIW

The "Depth in the Wild" (DIW) dataset (Chen et al., 2016) comprises 495,000 RGB images sampled from Flickr to maximize contextual and visual diversity, annotated with a single relative depth pair per image. Annotation is performed via Amazon Mechanical Turk, employing dual-worker redundancy for each pair and gold-standard verifications for quality control, yielding <1% annotation noise.

Annotation protocol:

  • Pair Types: Split between unconstrained (random locations) and symmetric (mirror about center, same scanline) point pairs to break image-geometry priors.
  • Relative-Only Supervision: The dataset does not provide absolute or even dense ordinal maps per image; rather, it amasses massive diversity via breadth of unique pairwise constraints.

Learning paradigm:

  • Ordinal Ranking Loss:

For point-pair (i,j)(i, j) with label yij{+1,1,0}y_{ij} \in \{+1,-1,0\} and output depths zi,zjz_i, z_j:

Lrel=(i,j){log(1+e(zizj))if yij=+1 log(1+ezizj)if yij=1 (zizj)2if yij=0L_{\text{rel}} = \sum_{(i, j)} \begin{cases} \log(1 + e^{-(z_i - z_j)}) & \text{if } y_{ij} = +1 \ \log(1 + e^{z_i - z_j}) & \text{if } y_{ij} = -1 \ (z_i - z_j)^2 & \text{if } y_{ij} = 0 \end{cases}

No explicit metric scale supervision; combination with small metric RGB-D sources (e.g., NYU Depth) can calibrate the model’s depth scale.

Performance:

  • On DIW, models trained on combined metric+relative supervision achieve WHDR (Weighted Human Disagreement Rate) ≃ 14.4%, outperforming pure RGB-D trained models (Chen et al., 2016).

4. Real-World Amodal Depth from Compositing Pipelines: ADIW

The Amodal Depth in the Wild (ADIW) dataset (Li et al., 3 Dec 2024) systematizes occlusion-aware depth estimation—assigning depth values, up to a scale, to both visible and occluded segments of objects in unconstrained natural photographs.

Construction protocol:

  • Mask Extraction: Leverages SA-1B segmentation [Kirillov et al. 2023] filtered by Pix2Gestalt to isolate whole-object masks.
  • Image Compositing: Random foreground objects are composited onto random backgrounds, producing synthetic occlusions.
  • Relative Depth via Pre-trained Models: Depth Anything V2 is run on both background and occluded images, producing normalized depths Db,DoD_b, D_o.
  • Scale-and-Shift Alignment: Solve for scale ss^* and shift tt^* aligning background and occluded depths over visible regions,

(s,t)=argmins,tiMvis(sdib+tdio)2,(s^*, t^*) = \arg\min_{s,t}\sum_{i \in M_{\text{vis}}}(s d^b_i + t - d^o_i)^2,

then construct ground-truth amodal depth Dgt=sDb+tD_{\text{gt}} = s^* D_b + t^*.

  • Annotations: Each image is assigned IoI_o (composite RGB), DoD_o (visible depth), MaM_a (amodal mask), DgtD_{gt} (amodal depth).

Statistics:

  • 564,000 train/val samples (one occlusion event per image), grouped by visible ratio into easy, medium, and hard occlusion categories. No final test split, but 4,000 sample validation set for cross-dataset evaluation.

Distinctive properties:

  • First large-scale, real-image, occlusion-depth corpus with relative annotation.
  • Emphasizes occluded-region geometry, not metric accuracy.

5. 360° Omnidirectional Depth in Unconstrained Scenes

The "360° in the Wild" dataset (Park et al., 27 Jun 2024) provides dense pseudo-ground-truth for 360° panoramic scenes captured worldwide, offering 25,000 panoramas from 273 video sequences scraped from YouTube. Approximately 11,300 frames have reliable multi-view structure-from-motion (SfM) + multi-view stereo (MVS) depth maps, backed by OpenSfM and COLMAP pipelines. Sequence splits (indoor, outdoor, mannequin) are performed at the sequence level for non-overlapping train/val/test sets.

Core pipeline:

  • Panorama and Pose Registration: Camera extrinsics are inferred for each panorama. Images are mapped to six cubemap faces and fed into COLMAP for dense MVS, then reprojected to equirectangular format.
  • Depth Normalization: Depth is normalized per-sequence (scale is ambiguous due to SfM).
  • Moving Object Masks: Binary masks flag dynamic or intrusively positioned elements, enabling training or evaluation under motion exclusion.

Utility:

  • Supports single-view omnidirectional depth, view synthesis (with NeRF++ spherical extensions), and self-supervised learning in highly diverse real-world contexts.

6. Evaluation Metrics and Benchmarking Protocols

Depth-in-the-wild datasets are typically evaluated with error metrics sensitive to both absolute and relative performance. Established metrics include:

  • Absolute relative error:

ϵabs=1Niyiy^iyi\epsilon_{\text{abs}} = \frac{1}{N} \sum_{i} \frac{|y_i - \hat{y}_i|}{y_i}

  • Root-mean-square error (RMSE):

ϵRMSE=1Ni(yiy^i)2\epsilon_{\text{RMSE}} = \sqrt{ \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2 }

  • RMSE (log):

ϵlog=1Ni(logyilogy^i)2\epsilon_{\log} = \sqrt{ \frac{1}{N} \sum_i ( \log y_i - \log \hat{y}_i )^2 }

  • Scale-invariant RMSE (Eigen et al.):

Esi(y,y^)=1Ni(logyilogy^i)21N2[i(logyilogy^i)]2E_{\text{si}}(y, \hat{y}) = \sqrt{ \frac{1}{N} \sum_i ( \log y_i - \log \hat{y}_i )^2 - \frac{1}{N^2} [\sum_i (\log y_i - \log \hat{y}_i) ]^2 }

  • Squared relative error:

ϵsqrel=1Ni(yiy^i)2yi\epsilon_{\text{sqrel}} = \frac{1}{N} \sum_i \frac{(y_i - \hat{y}_i)^2}{y_i}

  • Thresholded accuracy: For δ=max(yi/y^i,y^i/yi)\delta = \max(y_i/\hat{y}_i, \hat{y}_i/y_i), percentage of pixels with δ<1.25k\delta < 1.25^k (k=1,2,3k=1,2,3).

For ordinal/relative datasets, Weighted Human Disagreement Rate (WHDR) or Weighted Kinect Disagreement Rate (WKDR) is used, measuring point-pair ordering agreement.

7. Impact, Limitations, and Future Directions

Depth-in-the-wild datasets have advanced monocular depth architectures, particularly in generalization to unconstrained scenes. Models trained solely on GfD, for example, outperform real-RGB-D trained models on NYU v2 and exhibit strong qualitative performance on outdoor datasets (DIW, KITTI, Make3D), with substantial reduction in "texture-copy" artifacts and improved occlusion reasoning (Haji-Esmaeili et al., 2018).

However, challenges remain:

  • Synthetic-to-real gap: Game-engine renders lack true sensor artifacts (noise, optical blur), and domain gap persists despite data augmentation. A plausible implication is that injecting calibrated sensor models or multi-game engines could further close this gap.
  • Annotation Sparsity (Relative Depth): Single pairwise orderings per image, as in DIW, offer breadth but not local geometric detail. Dense annotations or compositional pipelines (as in ADIW) address this partially.
  • Metric Scale Ambiguity: SfM/MVS-based pseudo-ground-truth (e.g., 360° in the Wild) yields scale-ambiguous depths; downstream usage must normalize or recalibrate metric outputs.
  • Amodal Limitations: Mask errors and shape detail losses affect occlusion-depth annotation. Joint prediction of mask, depth, and inpainting, possibly augmented with human-annotated masks, is recommended (Li et al., 3 Dec 2024).

Areas for expansion include multi-game synthetic depth for stylistic/biome diversity, omnidirectional capture for novel-view synthesis, and joint annotation of normals/semantics for multi-task and self-supervised architectures. The release of large, diverse, and well-annotated depth-in-the-wild corpora is expected to remain central to robust, generalizable depth perception and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (4)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Depth-in-the-Wild Dataset.