Affine-Invariant Depth Estimation

Updated 2 June 2026

Affine-Invariant Depth Estimation is a framework defined by its invariance to global scale and shift transformations, ensuring reliable depth recovery from images.
It integrates methods from monocular depth estimation, statistical depth functions, and calibration techniques to improve robustness and cross-domain generalization.
Practical implementations leverage affine-invariant losses and calibration steps to enhance 3D geometric reasoning in both computer vision and statistical analysis.

Affine-invariant depth estimation encompasses a family of methods and theoretical frameworks in which the goal is to recover scene depth (or related "depth" quantities, e.g., statistical depth in multivariate data) so that predictions are unique only up to an unknown per-instance (typically per-image) affine transformation—namely, a global scale and shift. This paradigm is foundational in both computer vision (notably monocular depth estimation and related 3D tasks) and multivariate statistics. By acknowledging and explicitly modeling inherent ambiguities in depth perception from single-view imagery or high-dimensional geometry, affine-invariant depth estimation yields approaches with improved generalization, cross-domain transfer, and robustness to nuisance parameters.

1. Formal Definition and Theoretical Foundations

Affine-invariant depth estimation is defined by the invariance of its solution to global scale ( $a>0$ ) and shift ( $b$ ) parameters. In monocular depth estimation (MDE), a predicted map $\hat D(x)$ given true metric depth $D^*(x)$ satisfies

$\hat D(x) = a\, D^*(x) + b \hspace{1em} \forall x,$

where $x$ indexes pixels. Likewise, for disparity or inverse-depth predictions, an affine calibration is given by

$\hat d(x) = \alpha\, d^*(x) + \beta,$

with corresponding transformations in recovered 3D geometry. Such formulations arise naturally in the geometry of uncalibrated cameras, structure-from-motion, and in single-image perception, where metric scale is unresolvable without auxiliary information.

In statistical depth for multivariate data, affine-invariant depth functions, such as $L_p$ -depth and affine-invariant integrated rank-weighted (AI-IRW) depth, guarantee that the ordering or ranking of points is invariant under affine transformations of the data, crucial for applications in robust inference and anomaly detection (Dutta et al., 2016, Staerman et al., 2021).

2. Methodologies for Affine-Invariant Depth Estimation

2.1 Monocular Depth Estimation

State-of-the-art MDE networks, such as those described in "DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data" (Yin et al., 2020), are trained to recover depth up to an affine ambiguity using large-scale, diverse datasets where metric ground truth is inaccessible or unreliable. Training losses are chosen to be affine-invariant, e.g.,

Virtual Normal Loss (VNL): Compares local surface normals derived from predicted and reference depths, invariant to global scale and offset.
Scale-and-Shift-Invariant Loss (SSIL): Explicitly solves for the optimal affine alignment between predicted and reference depths via least squares.

Alternative approaches, such as hierarchical depth normalization (HDN), enforce multi-scale affine invariance by normalizing depth at various spatial and depth contexts before applying the loss (Zhang et al., 2022).

2.2 Affine-Invariant Test-Time Calibration

Given the prevalence of affine-invariant depth priors, conversion to metric depth at test time requires estimating affine parameters from additional cues:

Sensor-based Calibration: Sparse depth measurements (from IMU, LiDAR, or geometric constraints) enable robust estimation of the affine parameters $(a, b)$ via RANSAC+Huber fitting or least-squares regression, as in efficient pipelines for foundation models (e.g., Depth Anything) (Marsal et al., 2024).
Learning-based Calibration: Auxiliary modalities, such as language (CLIP-generated captions), offer uncertainty-aware envelopes for feasible affine parameters, subsequently refined via frozen vision features and separate lightweight calibration heads (Zhan et al., 4 Jan 2026).

2.3 Zero-Shot Depth Completion with Affine-Invariant Priors

Affine-invariant priors from pre-trained diffusion models (e.g., Marigold, DepthFM) can be aligned to sparse sensor measurements using test-time optimization over the model latent space, yielding dense metric depth completion without additional training (Hyoseok et al., 10 Feb 2025). This process enforces hard constraints on sparse alignments, smoothness priors, and structural loss (R-SSIM).

2.4 Geometric Computer Vision

In multi-view geometry, affine-invariant depth maps are used for robust pose estimation. Solvers are designed to account for independent affine ambiguities (scale and shift) in multi-view settings and are combined with standard epipolar geometry in hybrid pipelines that outperform scale-only RANSAC/PnP and classic keypoint-based approaches (Yu et al., 9 Jan 2025).

2.5 Multivariate Statistical Depth

Affine-invariant $L_p$ -depth and AI-IRW depth generalize the center-outward ranking of multivariate data. They are defined with explicit scatter (covariance) normalization and enable robust, nonparametric analysis that is immune to the choice of coordinate system or scale (Dutta et al., 2016, Staerman et al., 2021). Estimation leverages kernel smoothing, fast Monte Carlo sampling on the sphere, and robust scatter estimation.

3. Losses, Normalization, and Training Protocols

Loss or Normalization	Description	Affine-Invariance Enforced
Virtual Normal Loss (VNL)	L1 loss on local plane normals	Yes (surface geometry only)
Scale-Shift Invariant Loss	L2 loss after best affine alignment	Yes (global affine)
Hierarchical Depth Norm. (HDN)	Multi-scale normalization (spatial/depth)	Yes (local and global)
HDN Loss	Multi-context $b$ 0 over normalized maps	Yes
RANSAC+Huber Fitting	Robust affine parameter estimation	Yes (calibration step)
Closed-form Oracle in Inverse	Least-squares per-image for $b$ 1	Yes

Losses, normalization schemes, and calibration tools are chosen explicitly for their affine-invariant properties; this design is critical for zero-shot generalization across unknown cameras or scene domains.

4. Benchmarks, Quantitative Findings, and Comparative Results

Affine-invariant depth models consistently demonstrate improved cross-domain and zero-shot generalization. In "DiverseDepth" (Yin et al., 2020), affine-invariant models outperform both metric-trained and ordinal-only models on metrics such as AbsRel (after optimal affine alignment) and visual ordinal consistency (WHDR). For example, AbsRel on NYU is 11.7%; other reported numbers are KITTI 12.6%, ETH3D 22.5%, and ScanNet 10.4%.

Recent work in zero-shot depth completion reports up to 21% average RMSE reduction over test-time adaptation and domain-specific completion methods (Hyoseok et al., 10 Feb 2025). Calibration using sparse LiDAR improves single-image metric depth predictions on indoor and outdoor benchmarks by 40–45% over zero-shot baselines, with negligible test-time overhead (Marsal et al., 2024).

Affine correction in geometric pose estimation solvers yields significant gains, e.g., on ScanNet, median rotation error is reduced from $b$ 2 (classical 5-point) to $b$ 3, and 5-degree AUC increases from 19.5% to 23.1% (Yu et al., 9 Jan 2025).

In statistical data analysis, affine-invariant depth functions improve area under ROC curve (AUROC) in anomaly detection benchmarks and offer robust center-outward ordering that is theoretically and empirically superior to non-invariant analogues (Staerman et al., 2021).

5. Practical Implementations and Typical Pipelines

The following workflow components and practical realities are evident across the literature:

Training Datasets: Datasets such as DiverseDepth, HDN-mix, and others aggregate diverse, stereo, and synthetic data without requiring metric scales. Scene diversity is essential for robust affine-invariant generalization (Yin et al., 2020, Zhang et al., 2022).
Architecture and Loss Plug-ins: Many methods retain backbone architectures (ResNeXt, transformer encoders, U-Net decoders), embedding affine-invariant losses and normalization modules at the output or loss interface rather than the feature level.
Test-time Adaptation: Affine calibration steps are lightweight and may be separated (CPU/GPU parallelism). Reference point selection and robust calibration are critical for stability, especially under sensor or SLAM noise (Marsal et al., 2024).
Optimization in Diffusion Models: Iterative test-time optimization of the generative model's latent code incorporates hard constraints from sparse depth and preserves fine affine-invariant structure from the prior (Hyoseok et al., 10 Feb 2025).

Stage	Method/Procedure	Reference
Train	Affine-invariant loss, diverse data	(Yin et al., 2020, Zhang et al., 2022)
Test (no metric info)	Affine alignment for evaluation only	(Yin et al., 2020)
Test (with sparse metric)	Robust affine calibration (LiDAR/SfM)	(Marsal et al., 2024, Zhan et al., 4 Jan 2026)
Geometric applications	Pose solvers w/ affine correction	(Yu et al., 9 Jan 2025)

6. Limitations, Open Problems, and Future Directions

Current limitations highlight the possible need for more flexible per-region or per-plane affine corrections, as the global affine model can be insufficient for MDE errors with structured non-uniformity (Yu et al., 9 Jan 2025). There are also computational challenges in diffusion-based test-time optimization (several seconds to minutes per frame), motivating research into more efficient guidance and consistency models (Hyoseok et al., 10 Feb 2025).

Integration of camera intrinsics, temporal/video consistency, and multi-view joint alignment are promising for further reducing metric ambiguities. Additionally, transfer of affine-invariant learned backbones to fine-tuned downstream tasks such as SLAM, AR, and robotics is a plausible future direction (Yin et al., 2020).

7. Affine-Invariant Statistical Depth: Broader Context

The foundational role of affine-invariance in statistical depth theory is exhibited through $b$ 4-depth (Dutta et al., 2016) and AI-IRW depth (Staerman et al., 2021). These frameworks generalize the center-outward quantile notion to multivariate data, ensuring invariance under any invertible affine transformation. Such properties underpin robust classifiers and anomaly detectors with formal guarantees (Bayes risk consistency, finite-sample concentration), bridging geometric vision intuition and statistical theory.

In summary, affine-invariant depth estimation provides a unifying abstraction over a range of problems where only relative, shape-consistent, or rank-consistent depth can be reliably inferred. This paradigm, through careful methodological design and robust calibration, has enabled advances in cross-domain perception, geometric reasoning, and statistical modeling.