Papers
Topics
Authors
Recent
2000 character limit reached

Grounded Self-Distillation in Monocular Depth Estimation

Updated 24 December 2025
  • The paper introduces a targeted self-distillation approach that mitigates 3D mirage artifacts in monocular depth estimation models via a teacher-student framework.
  • It employs dual-view streaming with LoRA-enhanced student networks, combining Hallucination Knowledge Re-editing (HKR) and Non-hallucination Knowledge Preservation (NKP) to enforce planarity within defined ROIs.
  • Experimental results on the 3D-Mirage benchmark show significant reductions in deviation and confusion composite scores, while maintaining overall depth estimation accuracy and background priors.

Grounded Self-Distillation (GSD) is a targeted parameter-efficient self-distillation approach developed to address critical geometric hallucinations in monocular depth estimation (MDE) models when presented with perceptually ambiguous yet physically planar patterns. The method enables precise suppression of structured 3D hallucinations, or “3D Mirage” artifacts, within specified illusion regions-of-interest (ROIs) while retaining the model’s generalization capabilities and preserving background depth priors. GSD was introduced by Nguyen et al. (“Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry”) as the only end-to-end framework currently available for both quantifying and mitigating the 3D Mirage failure mode in foundation monocular depth models (Nguyen et al., 17 Dec 2025).

1. 3D Mirage Failure Mode and Motivation

MDE models trained on large-scale datasets have demonstrated strong semantic prior learning, enabling robust inference under diverse conditions. However, these models are empirically vulnerable to “3D Mirage” illusions: they hallucinate nonplanar geometry when viewing strictly planar but visually deceptive patterns (e.g., forced-perspective street art, photorealistic painted illusions) under restricted context. The failure exhibits two components:

  • Hallucination/Deviation: Nonplanarity is predicted within a region that is physically flat.
  • Contextual Instability/Confusion: The depth map within the ROI changes unpredictably with reduced context, such as cropping.

These vulnerabilities are undetected by standard pixel-wise metrics and cannot be addressed by generic post hoc regularization or naive fine-tuning, which results in catastrophic forgetting of broader geometric priors.

2. 3D-Mirage Benchmark and Quantification Metrics

To systematically diagnose and quantify 3D Mirage, the 3D-Mirage benchmark provides a curated dataset: 468 real street-art illusions, each with precise planar ROI masks and up to four context-suppressed crops, yielding 1,872 annotated input instances (Nguyen et al., 17 Dec 2025). Evaluation requires perceptually sensitive metrics:

  • Deviation Composite Score (DCS): Captures the magnitude of spurious local nonplanarity via a dual-view Laplacian analysis. For each ROI and cropping, Laplacian-filtered depth maps are percentile-normalized and summarized using top-decile sums and robust means:

DCSi=dcluster(i)+davg(i),withdcluster(i)=tfull,i2+tcrop,i2\mathrm{DCS}_i = d_{\mathrm{cluster}}(i) + d_{\mathrm{avg}}(i), \quad \text{with} \quad d_{\mathrm{cluster}}(i) = \sqrt{t_{\mathrm{full},i}^2 + t_{\mathrm{crop},i}^2}

  • Confusion Composite Score (CCS): Assesses stability to context by measuring ROI Laplacian divergences between the full and cropped views:

CCSi=Dcluster(i)+Davg(i)\mathrm{CCS}_i = D_{\mathrm{cluster}}(i) + D_{\mathrm{avg}}(i)

Low scores indicate geometric faithfulness and robustness under restricted context.

3. Architecture and Parameterization of Grounded Self-Distillation

GSD employs a teacher-student framework with minimal parameter footprint. The teacher TT is a frozen, pretrained ViT-based MDE model (e.g., DepthAnything V2-Large), and the student SS is initialized as a copy of TT but augmented with Low-Rank Adaptation (LoRA) modules in all transformer MLP layers and the patch embedding projection. The LoRA adapters introduce ≈0.7% additional parameters (≈4M on standard MDE backbones).

Training is structured as dual-view streaming: one “full” stream processes the uncropped image and one “crop” stream processes a context-restricted crop. Shared weights ensure that any adaptation is localized and parameter-efficient while benefiting from the teacher’s locked priors elsewhere in the image.

4. Loss Formulation: Planarity Enforcement and Knowledge Preservation

The total objective merges two orthogonal criteria:

  • Hallucination Knowledge Re-editing (HKR): This term utilizes the Laplacian operator to enforce L=0\mathcal{L}=0 within the ROI, promoting strict planarity:

LHKR=α1L(z)m+α2kwkk\mathcal{L}_\mathrm{HKR} = \alpha_1 \, \overline{|\mathcal{L}(z)|}_m + \alpha_2 \sum_k w_k \ell_k

Here, zz is the normalized predicted depth, mm indicates the illusion ROI, and k\ell_k penalizes distance to locally fitted planes with wkw_k softmax gating for plane proposals. The gating module ensures robust matching to the most representative local planar candidate or reverts to the teacher if required.

  • Non-hallucination Knowledge Preservation (NKP): This term self-distills the teacher’s outputs on background and boundary rings, preventing catastrophic forgetting and “leakage” of the enforced planarity:

LNKP=α3zzTmbg+\mathcal{L}_\mathrm{NKP} = \alpha_3 \overline{|z-z_T|}_{m_{\mathrm{bg}}} + \dots

Multiple boundary terms (rfr_f, rer_e, rgr_g) ensure that the enforced planarity is locally gated and does not extend past the illusion boundaries.

The full loss per branch is

L=LHKR+LNKP+gating regularizers\mathcal{L} = \mathcal{L}_\mathrm{HKR} + \mathcal{L}_\mathrm{NKP} + \text{gating regularizers}

The total objective is a sum biasing the crop stream (where the illusion is prominent) to dominate in the enforced ROI.

5. Experimental Results and Ablations

On the 3D-Mirage benchmark, GSD demonstrates strong suppression of hallucinated nonplanarity and contextual drift:

Model/Variant DCS ↓ CCS NYU-v2 Acc ↑ Background R2R^2
DAv2-L (teacher) 994.6 1.466e-3 90.13% 100%
Full Finetune Enc. 28.8 0.091e-3 63.01% 62.02%
No HKR 971.1 1.434e-3 90.15% 93.74%
No NKP 46.82 0.200e-3 87.99% 84.48%
Ours 64.20 0.204e-3 89.73% 93.89%

Relative improvement over the teacher: –93.5% DCS, –86.1% CCS. Qualitative maps confirm suppression of hallucinated bumps/pits exclusively within the annotated ROI, with background geometry unaffected. Ablations confirm both HKR and NKP are necessary: removing planarity enforcement (no HKR) leaves hallucinations intact; omitting NKP leads to over-flattening and global accuracy loss.

6. Limitations and Applicability

GSD, as currently formulated, addresses only planar “mirage” illusions within annotated ROIs and is validated exclusively on transformer-based encoder architectures. Broader ambiguities (textures, reflectance effects, adverse weather) remain unaddressed, and the framework does not yet extend to diffusion-based or generative MDE architectures. The need for precise ROI annotation and the constraint to static/frozen semantic teachers may limit applications in online or fully unsupervised settings. Extending the methodology to more diverse ambiguity sources and real-time correction remains an open research direction (Nguyen et al., 17 Dec 2025).

7. Significance and Future Directions

Grounded Self-Distillation provides a parameter- and sample-efficient route to structurally grounding foundation MDE models on challenging geometric datasets, notably without sacrificing learned representational power elsewhere. By introducing surgical geometric grounding within explicit spatial contexts, GSD represents a shift from aggregate pixel-wise depth metrics toward spatially and contextually robust depth prediction—an emergent requirement for safe and reliable perception modules in autonomous driving, robotics, and scene understanding. Future developments may focus on automated ROI discovery, extension to architecturally grounded models that inherently disentangle priors from pure geometry, and integration with real-time online hallucination detection (Nguyen et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Grounded Self-Distillation.