Depth-Ray Prediction Targets

Updated 14 November 2025

Depth-ray prediction targets are supervisory signals that model depth as a distribution along camera rays, capturing uncertainty and multi-modal scene geometry.
They are applied in methods like NeRF, multi-view stereo, and transformer-based models to enhance depth estimation and improve reconstruction fidelity in ambiguous regions.
Ray-based approaches utilize advanced loss functions such as Earth Mover’s Distance and multi-task SDF regression to lower RMSE and boost overall depth and point-cloud accuracy.

A depth-ray prediction target is a supervision signal or objective for learning depth from images that is formulated not as a scalar, per-pixel “depth” but in terms of the behavior or statistics of continuous or discrete distances along camera rays. Unlike traditional pixelwise depth targets, depth–ray prediction targets capture geometric uncertainty, ambiguity of surface location, or the multi-modal nature of scene geometry, and often dovetail with physical or graphical models such as volume rendering or ray marching. Recent advances have leveraged depth–ray prediction targets in neural field rendering, multi-view stereo, monocular estimation, and even acoustic localization.

1. Formulation and Motivation

Traditional depth supervision often treats each pixel independently, regressing the metric distance from the camera center to scene surfaces through pixelwise $\ell_2$ losses, or penalizing ordering errors for ordinal (relative) depth. However, in physically grounded vision systems—such as differentiable volume rendering, implicit field learning, or multi-view consistency—the generative process is inherently defined along camera rays rather than over image planes. This recognizes both the geometric uncertainty from occlusions/untextured regions and the volumetric or probabilistic interpretation of ray traversal.

Depth–ray prediction targets thus propose to supervise the distribution or structure of depths along a ray. These targets may encode uncertainty, surface ambiguity, or probabilistic termination. For example, in neural radiance fields (NeRF), the model naturally produces a distribution over possible ray termination depths via learned densities, motivating supervision that respects this full distribution rather than enforcing a single point estimate.

2. Methods and Key Examples

a) Discrete Ray Termination Distribution and Distributional Losses

In depth-guided NeRF training (Rau et al., 19 Mar 2024), the prediction target is formulated as the empirical distribution over ray termination depths:

For each pixel’s ray $r$ , NeRF computes densities $\sigma_i$ at sampled depths $\{d_i\}$ , yielding volume rendering weights $w_i$ and consequently a discrete depth pmf $\widehat{w}_i$ .
Instead of forcing the mean rendered depth to match a prior via an $\ell_2$ loss, the termination pmf is compared to a degenerate (delta-function) prior, typically estimated from a monocular depth predictor.
The supervision is imposed via the Earth Mover's Distance (EMD) between the NeRF-produced sample distribution $P$ and the prior $Q$ , a much softer and more distribution-aware criterion:

$L_{\text{EMD}}(P, Q) = \min_{\gamma \in \Gamma(P, Q)} \sum_{i,j} \gamma_{i,j} |y_i - z_0|$

with $P$ the empirical NeRF samples $\{y_i\}$ and $Q$ the prior depth $z_0$ for ray $r$ . The practical instantiation employs a differentiable Sinkhorn approximation.

b) Per-Ray 1D Implicit Fields and Zero Crossings

RayMVSNet and RayMVSNet++ (Xi et al., 2022, Shi et al., 2023) recast multi-view depth prediction as learning a 1D implicit field $f_r(t)$ —a signed distance function (SDF)—along each camera ray $r$ . The prediction target is the zero crossing $t^*$ : $f_r(t^*) = 0$ .

At training, the network is supervised not just on the regressed zero-crossing but on the whole predicted SDF sampled at multiple points, using multi-task losses for SDF regression, zero-cross location, and bracketing consistency.
The approach generalizes to fusing local context via attention or gating from neighboring rays and supports high-quality and efficient reconstruction even in low-texture regions.

c) Affine-Invariant and Ordinal Ray-Based Depth Targets

DiverseDepth (Yin et al., 2020) explores depth invariance by proposing affine-invariant prediction: not the metric depth along a ray, but its proportionality (up to $d' = s\,d + t$ ) to the unknown ground-truth, thereby preserving rays’ correct orientation while disregarding absolute scale and offset.

d) Confidence Modulation and Uncertainty-Aware Targets

In depth-guided NeRF (Rau et al., 19 Mar 2024), the supervision signal is modulated per-ray by the uncertainty estimated from a frozen monocular diffusion-based depth predictor. Uncertainty $u(r)$ upweights the photometric loss and downweights the (less reliable) geometric loss, enhancing robustness in poorly constrained regions.

e) Transformer-Based Per-Ray Prediction

RaySt3R (Duisterhof et al., 5 Jun 2025) extends the ray-based prediction target to 3D object completion via transformer architectures. Each query ray in a novel viewpoint is associated with depth, object mask, and confidence, and the predicted depths of rays from multiple views are fused into a 3D shape without relying on precomputed volumetric grids.

3. Loss Landscapes and Optimization

The departure from pointwise depth regression to distribution- or zero-crossing-based targets necessitates novel loss functions:

Distributional Losses: Earth Mover’s Distance (EMD), implemented with Sinkhorn regularization, provides a convex, softly enforcing loss that accommodates noisy, uncertain, or ambiguous depth priors, outperforming direct $\ell_2$ regression on rendered depths. Removing the MSE or $\ell_2$ on depth in favor of EMD yields a 24–31% reduction in RMSE on key benchmarks (Rau et al., 19 Mar 2024).
Multi-task Losses: SDF regression, zero-crossing regression, and zero-bracketing losses all help maintain local field consistency, significantly boosting both depth and point-cloud accuracy (Xi et al., 2022).
Uncertainty Weighting: Focal-style or confidence-based weighting schemes balance photometric and ray-based losses, abating overfitting to uncertain or unreliable priors (Rau et al., 19 Mar 2024, Duisterhof et al., 5 Jun 2025).

4. Empirical Performance and Evaluation

Empirical results consistently demonstrate that depth–ray prediction targets enable large improvements in geometric fidelity:

In depth-guided NeRF training (Rau et al., 19 Mar 2024), EMD-based supervision delivers AbsRel = 0.070 and RMSE = 0.221 on ScanNet, improving prior depth-supervised NeRFs by 11–66% on all tested depth metrics, while preserving photometric rendering performance (PSNR ≈ 21.7).
Similar ray-based strategies in RayMVSNet++ (Shi et al., 2023) achieve state-of-the-art on DTU (overall 0.328 mm) and ScanNet (AbsRel = 0.058), outperforming traditional cost-volume and ordinal losses.
RaySt3R (Duisterhof et al., 5 Jun 2025) delivers up to a 44% reduction in 3D chamfer distance against volumetric or multi-view completion, demonstrating the efficiency and geometric consistency of direct per-ray depth aggregation.

Approach	Loss Target	Key Depth Metric (AbsRel/RMSE)	3D Quality Metric (F-score/CD)
NeRF+EMD (Rau et al., 19 Mar 2024)	EMD over ray termination PMF	AbsRel=0.070, RMSE=0.221	—
RayMVSNet++ (Shi et al., 2023)	Zero-crossing + SDF field	AbsRel=0.058	F-score=58.47% (T&T); 0.328 mm (DTU)
RaySt3R (Duisterhof et al., 5 Jun 2025)	Per-ray depth+mask+conf.	—	CD=3.56 mm, F1@10 mm=0.930 (YCB)

Performance gains are strongest in ambiguous or uncertain regions, especially when uncertainty modulation is applied.

5. Theoretical Context and Interpretational Perspectives

Depth–ray prediction targets leverage the volumetric and path-aware nature of physical scene formation. Their theoretical justification rests upon:

The physical process of light transport and attenuation along rays.
The inherent ambiguity in ray termination due to occlusions, transparency, or lack of texture.
The geometric structure and continuity imparted by modeling SDFs or distributions, as opposed to independent pixelwise depths.

A plausible implication is that these targets are generically superior wherever the geometry is ambiguous under photometric supervision alone, as in the “shape-from-silhouette” or glass/occluder cases noted in NeRF systems (Rau et al., 19 Mar 2024).

6. Extensions and Applications Beyond Vision

Depth–ray prediction targets are not exclusive to vision. In underwater acoustic localization (Huang et al., 2023), ray-tracing expresses the mapping from launch angle to travel time (and horizontal distance) as a monotonic function, and target depth is inferred by inverting this function over rays. The estimation is robustified by iteratively bracketing the admissible launch angles (hence effective ray termination depths) and matching predicted to measured signal travel times via physically meaningful models.

This generality underscores the importance of ray-based targets anytime path-dependent measurements are involved, such as in acoustic, radar, or seismic imaging.

7. Outlook and Significance

By moving beyond the scalar or per-pixel view of depth, depth–ray prediction targets enable fundamentally richer, more physically grounded, and uncertainty-aware depth learning across rendering, reconstruction, and localization tasks. Their adoption is catalyzed by advances in differentiable rendering, transformer-based fusion, robust loss functions (EMD, multi-task SDF), and automated uncertainty estimation. Empirical evidence confirms their superiority on both classical and challenging benchmarks, notably in scenes with ambiguous geometry or in regimes with noisy or out-of-domain priors (Rau et al., 19 Mar 2024, Duisterhof et al., 5 Jun 2025, Shi et al., 2023).

This evolving paradigm is likely to underpin future generalization across tasks that require precise, robust, and context-aware depth reasoning, extending from computer vision to diverse sensing modalities.