Unsupervised Depth Completion Methods

Updated 22 December 2025

Unsupervised depth completion is a technique to generate dense depth maps from sparse measurements and RGB images without direct supervision, leveraging geometric constraints and self-supervisory signals.
Modern frameworks decompose relative depth and scale while integrating topological, photometric, and attention-driven cues for enhanced accuracy and computational efficiency.
These methods address challenges like sparse inputs and domain shifts, enabling robust real-time applications in robotics, augmented reality, and autonomous navigation.

Unsupervised depth completion refers to methods that infer dense depth maps from sparse depth measurements and auxiliary cues—typically RGB images—without direct supervision from ground-truth dense depth. These frameworks leverage inherent structure in the input data, geometric constraints, and self-supervisory signals (e.g., photometric consistency, sparse depth anchors) to achieve scale-consistent and high-fidelity depth completion. Recent advances include decomposed learning strategies, geometric fusion, physically principled priors, and plug-in frameworks for robustness, computational efficiency, and domain generalization.

1. Problem Formulation and Fundamental Principles

The canonical unsupervised depth completion task is defined as: given an input RGB image $I_t \in \mathbb{R}^{H \times W \times 3}$ and a sparse depth map $D_t \in \mathbb{R}^{H \times W}$ (often, with 0.1–5% valid pixels from LiDAR, VIO, or SLAM), predict a dense, scale-consistent depth map $\hat D \in \mathbb{R}^{H \times W}$ . Crucially, dense ground-truth depth is not available at training time.

Supervision is derived from a mixture of (i) sparse depth consistency (as a metric anchor), (ii) photometric or feature-metric consistency from neighboring frames $I_s$ , (iii) geometric or smoothness priors (e.g., edge-aware regularization), and (iv)—in some approaches—synthetic data or domain-specific cues.

Unsupervised completion pipelines generally solve: $\min_\theta \mathcal{L}(\hat D, D_t, I_t, I_s, \text{poses and intrinsics})$ where loss terms may include image-reconstruction (warping), sparse-depth fit, and various regularizers (Yan et al., 2022, Wong et al., 2021, Wong et al., 2021).

2. Model Architectures and Learning Strategies

State-of-the-art unsupervised depth completion methods fall into several architectural categories, distinguished by their fusion strategy, representation splitting, and how they encode geometry:

Decomposed Scale-Consistent Networks

DesNet (Yan et al., 2022) introduces decomposed scale-consistent learning (DSCL), splitting absolute dense depth prediction into a product of a scale-agnostic "relative" depth $d \in (0,1]$ and a global scale factor $\alpha$ : $\hat D = \alpha d$ The relative depth $d$ is predicted by a deep decoder, while the global scale $\alpha$ is regressed from the encoded representation. This decoupling improves learning stability and flexibility, as proven by a theorem guaranteeing that such decomposition admits at least as optimal a fit to the sparse metric data as direct prediction.

DesNet's network consists of five sub-networks: shared image-encoder, relative depth decoder, scale estimator, pose network (for multi-view constraints), and a global guidance network (see below). Supervision flows as follows: $d$ is optimized via feature/image warping; $\alpha$ is anchored by sparse depth.

Geometric and Topological Priors

Approaches such as FusionNet (Wong et al., 2021) and ScaffNet construct strong topological priors using only sparse depth. ScaffNet learns to "fill in" 3D connectivity (topology) via dense mapping from sparse points, entirely on synthetic data to avoid photometric domain shift. The learned prior $\hat D_0$ is refined by a photometric network receiving RGB and $\hat D_0$ , learning only a lightweight per-pixel scale and bias, which preserves the strong geometric prior and corrects for outlier regions using real images.

Structured geometric reasoning is also present in piecewise-planar scaffold-based pipelines (Wong et al., 2019), where image-guided Delaunay triangulation and interpolation form the initial dense estimate, followed by self-supervised encoder-decoder refinement.

Transformer and Attention-Driven Fusion

CHADET (Marsim et al., 21 Jul 2025) implements an efficient cross-hierarchical attention transformer. Depthwise block encoders extract features from both color and (quasi-)dense depth, then a multi-head attention decoder refines image features based on depth cues. The cross-hierarchical attention module iteratively fuses information at multiple depths and spatial scales, achieving fast inference with a minimal parameter count while preserving completion accuracy.

Plug-in and Adaptive Strategies

Protocols like AugUndo (Wu et al., 2023) and AdaFrame (Wong et al., 2021) are designed to be compatible with existing depth completion backbones. AugUndo introduces a loss-space geometric augmentation scheme that "undoes" coordinate warps of predicted depth to enable strong augmentations without corrupting the training signals. AdaFrame adaptively weights regularization and photometric supervision at each pixel and iteration via residual-driven annealing, yielding dynamic soft visibility masks and spatially variant smoothness—leading to better occlusion handling and minimization of over-regularization.

Continual and Domain-Agnostic Strategies

ProtoDepth (Rim et al., 17 Mar 2025) achieves continual, unsupervised depth completion under domain shift by freezing the core model and learning only small per-domain prototype sets (global and local feature adaptors). When domain identity is withheld at test time, a domain descriptor mechanism enables automatic prototype selection, thus reducing catastrophic forgetting by over 50% without compromising performance.

StarryGazer (Hong et al., 15 Dec 2025) addresses single-image, domain-agnostic completion by leveraging off-the-shelf monocular depth estimators to generate synthetic dense depth (segment-wise affine-remapped), training a refinement network to combine RGB, sparse, and relative depth in a completely unsupervised manner.

3. Guidance Mechanisms and Regularization

Handling "holes," uncertainty, and discontinuous surfaces is central to unsupervised depth completion. Representative mechanisms include:

Global Depth Guidance (GDG)

In DesNet, the GDG module propagates dense yet coarse local reference by morphological dilation of sparse depth. A dense-to-sparse attention scheme then injects these references into the feature map, mitigating depth holes arising from extremely sparse input (Yan et al., 2022). The attention weights are computed as

$A = \mathrm{softmax}(QK^\top/\sqrt{C})$

where $Q$ , $K$ , and $V$ are query, key, value projections from main and guidance features, fusing context from dilated regions into invalid pixels.

Variational and Edge-Aware Regularization

Binary anisotropic diffusion tensors (B-ADT) (Yao et al., 2020) are constructed to eliminate smoothing across occlusion boundaries, a longstanding weakness of TGV/ADT regularization. The B-ADT ensures that no smoothing penalty is imposed across detected boundaries, preserving geometric discontinuities in the reconstructed depth.

Edge-aware smoothness, a ubiquitous component, punishes gradients in predicted depth unless they are supported by corresponding image gradients, as formalized in

$L_\text{smooth} = \sum_{p} e^{-|\nabla I(p)|} |\nabla \hat D(p)|$

and its variants (Yan et al., 2022, Wong et al., 2021, Wong et al., 2021).

Adaptive Annealing and Visibility

AdaFrame (Wong et al., 2021) constructs annealed soft masks for co-visibility based on residuals; residual-driven weights modulate both which regions are trusted for photometric consistency and the spatial degree of smoothness regularization.

4. Self-Supervised and Weakly-Supervised Losses

The central unsupervised loss paradigm combines several terms, each serving distinct roles:

Photometric Consistency: Enforces that the predicted dense depth, together with the estimated camera pose, can reconstruct neighboring frames by warping via SE(3) and perspective projection. Structural similarity (SSIM) is often combined with the L1 intensity residual.
Sparse Depth (Metric Anchor): The predicted dense map must match the input sparse depth at observed pixels, grounding the otherwise ambiguous scale.
Edge-Aware Smoothness: As above, regularizes depth locally but allows for discontinuities at strong color or semantic boundaries.
Topology or Geometry Priors: Methods such as FusionNet and ScaffNet include additional losses ensuring that the coarse solution (from synthetic/sparse depth) is not degraded by subsequent photometric refinement.

Table: Typical Loss Terms in Unsupervised Depth Completion (symbols as in original works):

Name	Formula	Purpose
Photometric loss	$\mathcal{L}_{ph} = \sum [\alpha (1 - \text{SSIM})/2 + (1 - \alpha) \\|I_t - I_s'\\|_1 ]$	Checks multi-view consistency
Sparse depth loss	$\mathcal{L}_{sz} = \sum_{q \in \Omega_z} \|\hat D(q) - D_t(q)\|$	Metric scale anchor
Smoothness loss	$\mathcal{L}_{sm} = \sum_x e^{-\|\nabla I_t(x)\|}\|\nabla \hat D(x)\|$	Local regularity, edge-aware
Topology prior loss	$\mathcal{L}_{pz} = \sum_x W(x) \|\hat D(x) - \hat D_0(x)\| / \sum_x W(x)$	Prevents unnecessary deformation

Ablations establish necessity: omitting any major term (e.g., SPP pooling on sparse input, adaptive visibility) results in substantial error increases (Wong et al., 2021, Wong et al., 2021).

5. Datasets, Benchmarks, and Empirical Results

Evaluation uses standard indoor/outdoor datasets, including KITTI Depth Completion, NYUv2, SUN RGB-D, VOID (VIO-based indoor+outdoor), OpenLORIS, and domain-shifted settings (Waymo, ScanNet, ClearGrasp).

Key benchmarks:

KITTI Depth Completion (LiDAR, Driving): DesNet achieves RMSE=938.5 mm (−12% over KBNet); StarryGazer achieves MAE=242.4 mm, RMSE=1061.4 mm (the lowest among unsupervised strategies); ScaffNet/FusionNet/KBNet form the modern baselines (Yan et al., 2022, Hong et al., 15 Dec 2025, Wong et al., 2021).
NYUv2 (Indoor): DesNet RMSE=188.3 mm; StarryGazer RMSE=0.171 m; RDFC-GAN achieves RMSE=0.120 m (also reporting strong inlier percentage at tight error bounds), often surpassing earlier supervised approaches (Yan et al., 2022, Hong et al., 15 Dec 2025, Wang et al., 2023).
VOID (Indoor/Outdoor, VIO-based Sparse Depth): KBNet and Struct-MDC report MAE ≈ 39.8 mm (indoor), showing robust generalization and low parameter count (Wong et al., 2021, Jeon et al., 2022).

Table: Selected Unsupervised Completion Performance (Val/Test splits, lower is better)

Method	KITTI RMSE (mm)	NYUv2 RMSE (mm)	VOID MAE (mm)
KBNet (Wong et al., 2021)	1068.1	198	39.8
RDFC-GAN (Wang et al., 2023)	–	120	–
Struct-MDC (Jeon et al., 2022)	–	245.6	111.3
StarryGazer (Hong et al., 15 Dec 2025)	1061.4	171	–
DesNet (Yan et al., 2022)	938.5	188.3	–

6. Special Topics: Domain Shift, Efficiency, and Robotic Deployment

Domain Adaptation and Continual Learning

ProtoDepth is designed for sequential deployment in shifting environments, decoupling backbone learning from domain adaptation. At each new domain, it trains only domain-specific prototype sets (scales, feature biases), eliminating forgetting. When domain identity is ambiguous, it uses a descriptor-matching mechanism to select the appropriate prototypes, reducing catastrophic forgetting by over 50% compared to rehearsal or regularization baselines (Rim et al., 17 Mar 2025).

StarryGazer's synthetic pairing enables domain-agnostic, label-free training, yielding a model that outperforms both prior unsupervised methods and naïvely rescaled monocular depth estimators in out-of-domain evaluation (Hong et al., 15 Dec 2025).

Computational Efficiency

Many recent methods address the trade-off between model complexity and real-time constraints. CHADET achieves 11.5 ms inference for 1.1 million parameters, outperforming much larger prior art—including KBNet and FusionNet—without loss in completion accuracy (Marsim et al., 21 Jul 2025). Struct-MDC demonstrates full real-time operation (≈31 FPS on high-end GPU) including SLAM, geometric triangulation, and learned refinement (Jeon et al., 2022).

Robotic and Active Learning Contexts

Data collection policy critically affects performance of exploration-deployed robotic systems. DEUX introduces active learning for unsupervised depth completion: by steering a robot toward regions of high depth-uncertainty (as measured by photometric reconstruction error), it achieves on average >18% RMSE improvement over heuristic and RL-based baselines on MP3D/HM3D (Chancán et al., 2023). AugUndo's geometry-undo pipeline enables aggressive augmentations—rotations, flips, large translations—yielding up to 25% generalization gains across six datasets, indicating the importance of augmentation-resilient loss design (Wu et al., 2023).

7. Current Limitations and Open Directions

Sparse Input Limits: Performance degrades under extremely sparse depth (<0.1%), especially for methods lacking strong geometry priors (Wong et al., 2021).
Occlusions and Non-Lambertian Surfaces: Photometric and feature-metric losses are vulnerable in regions with specular reflections or dynamic occlusions (Yan et al., 2022, Wong et al., 2021).
Generalization: While methods like KBNet, StarryGazer, and ProtoDepth improve adaptation, batch normalization and learned projectors may still overfit sparse depth statistics of the training domain (Hong et al., 15 Dec 2025, Rim et al., 17 Mar 2025).
Complex Geometries, Manhattan versus Unstructured Environments: RDFC-GAN leverages Manhattan world priors, so may not generalize well to highly non-rectilinear or outdoor scenes (Wang et al., 2023).
Uncertainty Quantification: Few models provide per-pixel uncertainty, hindering robust deployment in safety-critical or outlier-prone contexts—future work may merge probabilistic models with current deterministic pipelines (Wong et al., 2021, Yan et al., 2022).

References

(Yan et al., 2022) DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion
(Wong et al., 2021) Learning Topology from Synthetic Data for Unsupervised Depth Completion
(Rim et al., 17 Mar 2025) ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
(Wong et al., 2019) Unsupervised Depth Completion from Visual Inertial Odometry
(Wang et al., 2023) RDFC-GAN: RGB-Depth Fusion CycleGAN for Indoor Depth Completion
(Yao et al., 2020) Discontinuous and Smooth Depth Completion with Binary Anisotropic Diffusion Tensor
(Wong et al., 2021) Unsupervised Depth Completion with Calibrated Backprojection Layers
(Jeon et al., 2022) Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging Structural Regularities from Visual SLAM
(Wu et al., 2023) AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation
(Chancán et al., 2023) DEUX: Active Exploration for Learning Unsupervised Depth Perception
(Wong et al., 2021) An Adaptive Framework for Learning Unsupervised Depth Completion
(Hong et al., 15 Dec 2025) StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion
(Marsim et al., 21 Jul 2025) CHADET: Cross-Hierarchical-Attention for Depth-Completion Using Unsupervised Lightweight Transformer

Unsupervised depth completion is a rapidly advancing field at the interface of geometry, self-supervision, and domain adaptation, offering robust dense depth estimation in the absence of dense ground-truth, and is central to real-world robotics, AR, and autonomous platforms.