Depth-in-the-Wild Dataset Overview

Updated 10 December 2025

Depth-in-the-wild datasets are collections that capture realistic depth cues in unconstrained environments using both synthetic photorealism and real-world annotations.
These datasets integrate metric, relative, and amodal depth modalities, enabling comprehensive evaluations through methodologies like game-engine renders, crowdsourced annotations, and multi-view reconstructions.
Advances in these datasets address challenges such as the synthetic-to-real domain gap and sparse annotations, thereby improving generalizable depth estimation across varied imaging conditions.

Depth-in-the-wild datasets provide supervisory signals for monocular or multi-view depth estimation under unconstrained, realistic imaging conditions. They systematically address the limitations of conventional RGB-D corpora, which are typically confined to restricted lighting, scene types, or modalities, by capturing or synthesizing depth ground-truth across diverse environments and acquisition pipelines. Recent advances encompass synthetic photorealistic game renders, large-scale web image annotation protocols, algorithmic composition for amodal (occlusion-aware) depth, and real omnidirectional/360° capture with multi-view or structure-from-motion depth.

1. Dataset Taxonomy and Core Modalities

The term "depth-in-the-wild" spans distinct modalities and task-specific operationalizations:

Metric Depth: Every pixel is assigned an absolute distance value, typically in meters or normalized sequence units. This is standard for most synthetic datasets, some multiview reconstructions, and gaming-engine buffers.
Relative (Ordinal) Depth: Only depth orderings (e.g., point i closer than point j) are annotated, often via crowdsourcing. This enables scaling over unconstrained real images where metric sensors are infeasible.
Amodal Depth: Supervisory signals cover both visible and occluded (hidden) regions, requiring the model to hallucinate plausible geometry for occluded surfaces.

A representative summary of major datasets:

Dataset	Modality	Domain	Scale
NYU Depth v2	RGB-D (metric)	Indoor	1.4 K
KITTI, Make3D	RGB-D (metric)	Driving/out	10–50 K
DIW [Chen et al. 2016]	RGB + relative	Mixed	≈495 K
GfD [Playing for Depth]	Synthetic RGB-D	GTA V	200 K
ADIW	Real RGB + masks	“In the wild”	564 K
360° in the Wild	RGB-D (pseudo)	360° capture	25 K

These datasets, each with distinct annotation protocols and ground-truth representations, enable benchmarking and training for a range of depth learning paradigms (Haji-Esmaeili et al., 2018, Chen et al., 2016, Li et al., 3 Dec 2024, Park et al., 27 Jun 2024).

2. Synthetic Photorealistic Depth: The GfD Dataset

The Grand Theft Auto V-derived dataset, often referred to as "GfD," operationalizes synthetic photorealism as a scalable platform for depth annotation (Haji-Esmaeili et al., 2018). GTA V's Northlight engine provides an open world of $\sim$ 100 $\text{km}^2$ with diverse interiors, lighting, and weather permutations. The data-collection protocol efficiently intercepts the in-game camera at full fidelity (up to 8K), capturing at the native game loop rate (~60 fps), before subsampling frames to maximize scene and environmental diversity.

Critical properties:

Scene/Environment Diversity: Urban streets (≈40%), interiors (20%), rural/highways (15%), tunnels/underground (10%), special locations (15%), each under varied lighting/weather.
Depth Buffer Extraction: DirectX-driver injection preserves per-pixel, noise-free, metric depth in world units, bypassing imaging artifacts like motion blur or DOF.
Post-processing: Depth frames undergo histogram equalization, log transformation, and z-standardization to serve both metric and relative tasks, and to standardize across atmospheric and visual filters.
Data Splits: 160,000/20,000/20,000 for training/validation/test, sampled uniformly over environments and temporal/weather cycles.

Despite photorealistic rendering, a nontrivial domain gap persists: lens distortion, sensor noise, and tonemapping artifacts in real sensors; mitigations via RGB standardization and randomized data augmentations reduce overfitting to video-game artifacts.

3. Large-Scale Human-Annotated Relative Depth: DIW

The "Depth in the Wild" (DIW) dataset (Chen et al., 2016) comprises 495,000 RGB images sampled from Flickr to maximize contextual and visual diversity, annotated with a single relative depth pair per image. Annotation is performed via Amazon Mechanical Turk, employing dual-worker redundancy for each pair and gold-standard verifications for quality control, yielding <1% annotation noise.

Annotation protocol:

Pair Types: Split between unconstrained (random locations) and symmetric (mirror about center, same scanline) point pairs to break image-geometry priors.
Relative-Only Supervision: The dataset does not provide absolute or even dense ordinal maps per image; rather, it amasses massive diversity via breadth of unique pairwise constraints.

Learning paradigm:

Ordinal Ranking Loss:

For point-pair $(i, j)$ with label $y_{ij} \in \{+1,-1,0\}$ and output depths $z_i, z_j$ :

$L_{\text{rel}} = \sum_{(i, j)} \begin{cases} \log(1 + e^{-(z_i - z_j)}) & \text{if } y_{ij} = +1 \ \log(1 + e^{z_i - z_j}) & \text{if } y_{ij} = -1 \ (z_i - z_j)^2 & \text{if } y_{ij} = 0 \end{cases}$

No explicit metric scale supervision; combination with small metric RGB-D sources (e.g., NYU Depth) can calibrate the model’s depth scale.

Performance:

On DIW, models trained on combined metric+relative supervision achieve WHDR (Weighted Human Disagreement Rate) ≃ 14.4%, outperforming pure RGB-D trained models (Chen et al., 2016).

4. Real-World Amodal Depth from Compositing Pipelines: ADIW

The Amodal Depth in the Wild (ADIW) dataset (Li et al., 3 Dec 2024) systematizes occlusion-aware depth estimation—assigning depth values, up to a scale, to both visible and occluded segments of objects in unconstrained natural photographs.

Construction protocol:

Mask Extraction: Leverages SA-1B segmentation [Kirillov et al. 2023] filtered by Pix2Gestalt to isolate whole-object masks.
Image Compositing: Random foreground objects are composited onto random backgrounds, producing synthetic occlusions.
Relative Depth via Pre-trained Models: Depth Anything V2 is run on both background and occluded images, producing normalized depths $D_b, D_o$ .
Scale-and-Shift Alignment: Solve for scale $s^*$ and shift $t^*$ aligning background and occluded depths over visible regions,

$(s^*, t^*) = \arg\min_{s,t}\sum_{i \in M_{\text{vis}}}(s d^b_i + t - d^o_i)^2,$

then construct ground-truth amodal depth $D_{\text{gt}} = s^* D_b + t^*$ .

Annotations: Each image is assigned $I_o$ (composite RGB), $D_o$ (visible depth), $M_a$ (amodal mask), $D_{gt}$ (amodal depth).

Statistics:

564,000 train/val samples (one occlusion event per image), grouped by visible ratio into easy, medium, and hard occlusion categories. No final test split, but 4,000 sample validation set for cross-dataset evaluation.

Distinctive properties:

First large-scale, real-image, occlusion-depth corpus with relative annotation.
Emphasizes occluded-region geometry, not metric accuracy.

5. 360° Omnidirectional Depth in Unconstrained Scenes

The "360° in the Wild" dataset (Park et al., 27 Jun 2024) provides dense pseudo-ground-truth for 360° panoramic scenes captured worldwide, offering 25,000 panoramas from 273 video sequences scraped from YouTube. Approximately 11,300 frames have reliable multi-view structure-from-motion (SfM) + multi-view stereo (MVS) depth maps, backed by OpenSfM and COLMAP pipelines. Sequence splits (indoor, outdoor, mannequin) are performed at the sequence level for non-overlapping train/val/test sets.

Core pipeline:

Panorama and Pose Registration: Camera extrinsics are inferred for each panorama. Images are mapped to six cubemap faces and fed into COLMAP for dense MVS, then reprojected to equirectangular format.
Depth Normalization: Depth is normalized per-sequence (scale is ambiguous due to SfM).
Moving Object Masks: Binary masks flag dynamic or intrusively positioned elements, enabling training or evaluation under motion exclusion.

Utility:

Supports single-view omnidirectional depth, view synthesis (with NeRF++ spherical extensions), and self-supervised learning in highly diverse real-world contexts.

6. Evaluation Metrics and Benchmarking Protocols

Depth-in-the-wild datasets are typically evaluated with error metrics sensitive to both absolute and relative performance. Established metrics include:

Absolute relative error:

$\epsilon_{\text{abs}} = \frac{1}{N} \sum_{i} \frac{|y_i - \hat{y}_i|}{y_i}$

Root-mean-square error (RMSE):

$\epsilon_{\text{RMSE}} = \sqrt{ \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2 }$

RMSE (log):

$\epsilon_{\log} = \sqrt{ \frac{1}{N} \sum_i ( \log y_i - \log \hat{y}_i )^2 }$

Scale-invariant RMSE (Eigen et al.):

$E_{\text{si}}(y, \hat{y}) = \sqrt{ \frac{1}{N} \sum_i ( \log y_i - \log \hat{y}_i )^2 - \frac{1}{N^2} [\sum_i (\log y_i - \log \hat{y}_i) ]^2 }$

Squared relative error:

$\epsilon_{\text{sqrel}} = \frac{1}{N} \sum_i \frac{(y_i - \hat{y}_i)^2}{y_i}$

Thresholded accuracy: For $\delta = \max(y_i/\hat{y}_i, \hat{y}_i/y_i)$ , percentage of pixels with $\delta < 1.25^k$ ( $k=1,2,3$ ).

For ordinal/relative datasets, Weighted Human Disagreement Rate (WHDR) or Weighted Kinect Disagreement Rate (WKDR) is used, measuring point-pair ordering agreement.

7. Impact, Limitations, and Future Directions

Depth-in-the-wild datasets have advanced monocular depth architectures, particularly in generalization to unconstrained scenes. Models trained solely on GfD, for example, outperform real-RGB-D trained models on NYU v2 and exhibit strong qualitative performance on outdoor datasets (DIW, KITTI, Make3D), with substantial reduction in "texture-copy" artifacts and improved occlusion reasoning (Haji-Esmaeili et al., 2018).

However, challenges remain:

Synthetic-to-real gap: Game-engine renders lack true sensor artifacts (noise, optical blur), and domain gap persists despite data augmentation. A plausible implication is that injecting calibrated sensor models or multi-game engines could further close this gap.
Annotation Sparsity (Relative Depth): Single pairwise orderings per image, as in DIW, offer breadth but not local geometric detail. Dense annotations or compositional pipelines (as in ADIW) address this partially.
Metric Scale Ambiguity: SfM/MVS-based pseudo-ground-truth (e.g., 360° in the Wild) yields scale-ambiguous depths; downstream usage must normalize or recalibrate metric outputs.
Amodal Limitations: Mask errors and shape detail losses affect occlusion-depth annotation. Joint prediction of mask, depth, and inpainting, possibly augmented with human-annotated masks, is recommended (Li et al., 3 Dec 2024).

Areas for expansion include multi-game synthetic depth for stylistic/biome diversity, omnidirectional capture for novel-view synthesis, and joint annotation of normals/semantics for multi-task and self-supervised architectures. The release of large, diverse, and well-annotated depth-in-the-wild corpora is expected to remain central to robust, generalizable depth perception and understanding.