Virtual Normal Loss (VNL) for Depth Estimation
- Virtual Normal Loss (VNL) is a high-order geometric regularization method that enforces global third-order constraints on monocular depth estimation by aligning normals computed from triplets of 3D points.
- It mathematically formulates virtual normals using the cross-product of point differences, ensuring robustness against sensor noise and affine transformations.
- Empirical results on benchmarks like NYU and KITTI demonstrate VNL's improved metric accuracy and enhanced surface normal estimation compared to conventional pixel-wise losses.
Virtual Normal Loss (VNL) is a high-order geometric regularization technique for monocular depth prediction that enforces consistency between predicted and ground-truth virtual normals—normals of planes defined by randomly sampled triplets of 3D points lifted from a depth map. Unlike conventional pixel-wise or local geometric losses, VNL imposes global, third-order geometric constraints on scene structure, resulting in increased metric depth accuracy, superior 3D shape recovery, and robust surface normal estimation. VNL has demonstrated state-of-the-art performance on established benchmarks and has proven effective in both metric and affine-invariant depth estimation regimes (Yin et al., 2019, Yin et al., 2021).
1. Geometric Motivation and Theoretical Foundations
High-order relations among distant points in a scene encode crucial 3D structure, such as co-planarity and global surface layout, which cannot be captured by strictly local losses. Standard per-pixel (L1/L2) or local smoothness losses promote only first-order or, at best, pairwise consistency and are highly sensitive to sensor noise. VNL is founded on the observation that enforcing the agreement of global constraints—specifically the normals of planes defined by distant, non-colinear triplets of 3D points—can drive more faithful scene geometry recovery. This loss leverages the invariance of normal direction to global affine transformations, providing robustness to scale and shift ambiguities in single-image depth estimation (Yin et al., 2019, Yin et al., 2021).
2. Definition and Mathematical Formulation
Given an image with known intrinsic parameters and a predicted depth map , each pixel is reconstructed to a 3D point via the pinhole camera model: Randomly sampled triplets are constrained to be both non-colinear (edge angles in with typical –, ) and long-range (minimum pairwise distances 0 set relative to the scene scale). For each triplet, the unit virtual normal is computed: 1 VNL is defined as the average L1 distance between the set of predicted and ground-truth virtual normals: 2 where 3 is the number of sampled triplets. This construction enforces third-order geometrical constraints over extensive spatial regions (Yin et al., 2019, Yin et al., 2021).
3. Sampling Procedure and Loss Integration
Triplets are uniformly sampled per image during each training iteration, with 4 to 5 providing effective coverage. Empirical results reveal rapid improvement up to 6 triplets, with diminishing returns at higher 7. Each triplet must pass non-colinearity (minimum/maximum edge angle) and distance (minimum 3D or image-space separation) tests to avoid degeneracy and ensure that normals are informative about large-scale geometry.
VNL is implemented in conjunction with a standard pixel-wise depth loss—commonly weighted cross-entropy over quantized depth bins (e.g., DORN loss) for metric depth, or a scale- and shift-invariant loss (e.g., MiDaS) for affine-invariant training. The overall training objective is: 8 with 9 typically set to 5 to balance gradient magnitudes (Yin et al., 2019, Yin et al., 2021). For stability and focus, online hard example mining is applied by dropping the bottom 10–20% (lowest error) triplets per batch.
4. Empirical Performance and Ablation Studies
VNL yields substantial improvements over prior pixel-wise and local-geometric losses, especially in challenging or diverse conditions. On NYU Depth-V2, integrating VNL with ResNeXt-101 yields an absolute relative error (Abs-Rel) of 0.108, surpassing DORN's 0.115. On KITTI, VNL-augmented models attain 0, exceeding the DORN baseline (0.932). Detailed ablations indicate that:
- Adding local surface normal loss yields minor improvements (e.g., Abs-Rel from 0.1427 to 0.1406);
- Global pairwise 3D L1 loss produces moderate gains (Abs-Rel to 0.1380);
- VNL achieves a marked reduction (Abs-Rel to 0.1337).
Results confirm that VNL is significantly more robust to depth noise than local normal estimation and that the reconstructed point clouds from VNL-trained depths exhibit flatter planes and crisper boundaries (Yin et al., 2019, Yin et al., 2021). Surface normals computed from predicted depth reach 1 within 2 error on NYU, outperforming GeoNet and DORN-based approaches (approx. 3).
Zero-shot generalization experiments on five datasets (DIW, NYU, KITTI, ETH3D, ScanNet) demonstrate that VNL-augmented models trained using the DiverseDepth dataset and scale/shift-invariant objectives achieve average Abs-Rel 40.14, the best on three of five benchmarks, and WHDR 14.3% on DIW (Yin et al., 2021).
5. Affine-Invariant Depth Learning and Generalization
A central property of VNL is invariance of virtual normal direction to affine (scale and shift) transformations in the depth dimension. If 5 for all predictions, resulting 3D points undergo a global affine transformation, but the normal computed via a cross-product remains directionally invariant. This property allows the use of VNL for supervising networks on diverse, uncalibrated datasets with unknown metric scale, supporting robust learning of shape up to an unknown affine code. A final 2-parameter least-squares fit for scale and shift can be applied post hoc to recover metric depth on a validation set (Yin et al., 2021).
6. Implementation, Hyperparameters, and Practical Recommendations
Batch triplet sampling and normal computation should be vectorized and executed efficiently on GPU. Recommended hyperparameter selections include: 6 triplets per image; non-colinearity angle threshold 7–8; minimum pairwise 3D distance 9m or 0–1 of max scene depth; loss weight 2 (default 3). For metric depth, a ResNeXt-101 encoder, SGD optimizer with initial learning rate 4 (decayed polynomially), momentum 5, and weight decay 6 are effective. Affine-invariant training with DiverseDepth benefits from multi-curriculum learning based on sample difficulty scoring.
VNL requires ground-truth depth maps for normal computation and, thus, is suited to metric training with such supervision or pseudo-metric data from stereo/self-supervised fusion. It assumes static, rigid scenes and may be extended by incorporating differentiable rendering for mesh-based refinement (Yin et al., 2019, Yin et al., 2021).
7. Limitations and Extensions
VNL relies on availability of ground-truth or reliably fused multi-view depth for supervision. Its global geometric constraint presumes static, rigid environments; dynamic objects or illumination changes may introduce inconsistencies. For application to video-based self-supervision, normals from multi-view or SLAM fusion can substitute for ground truth. Potential extensions include integrating VNL with mesh fitting, photometric loss models, or differentiable rendering frameworks for joint 3D reconstruction and depth refinement (Yin et al., 2021).