- The paper introduces a novel unsupervised finetuning paradigm with perturbation-based consistency to boost robustness in monocular depth estimation.
- The model employs a spatial distance constraint that preserves semantic boundaries and structural details even under significant image corruptions.
- Empirical evaluations show that DepthAnything-AC outperforms existing models on diverse benchmarks, particularly in adverse weather and low-light conditions.
DepthAnything-AC: Advancing Monocular Depth Estimation Robustness in Adverse Conditions
DepthAnything-AC introduces a significant advancement in monocular depth estimation (MDE) by addressing the persistent challenge of robust depth prediction under complex, real-world conditions such as adverse weather, illumination changes, and sensor-induced distortions. While recent foundation MDE models have demonstrated strong zero-shot generalization in standard scenarios, their performance degrades notably in the presence of image corruptions and environmental variability. DepthAnything-AC proposes a novel unsupervised finetuning paradigm and a geometric constraint to bridge this robustness gap, all while maintaining generalization on standard benchmarks.
Methodological Contributions
DepthAnything-AC is built upon two core innovations:
- Perturbation-Based Consistency Regularization The model leverages a small, unlabeled dataset from general scenes and applies a diverse set of perturbations—simulating lighting changes, weather effects, blur, and contrast variations—to each image. The training objective enforces consistency between the model’s predictions on the original and perturbed images. This is formalized as an affine-invariant loss on the predicted disparities, ensuring the model learns invariance to these corruptions without requiring ground-truth depth for the perturbed data. To prevent drift from the original model’s capabilities, a knowledge distillation loss is applied, using the frozen pre-trained model as a teacher for unperturbed images.
- Spatial Distance Constraint (SDR) Recognizing that per-pixel losses are insufficient for capturing object boundaries and structural details—especially under corruption—the authors introduce a spatial distance constraint. This constraint computes a geometric distance matrix between all patch pairs, combining both positional and predicted depth differences. The model is penalized if the spatial distance relations in perturbed images deviate from those in the original, as predicted by the frozen teacher. This encourages the network to preserve semantic boundaries and object structures, even when texture information is degraded.
The overall loss is a weighted sum of the consistency, knowledge distillation, and spatial distance losses, with empirical results showing insensitivity to the precise weighting.
Implementation and Training
DepthAnything-AC is finetuned from DepthAnythingV2, using a ViT-S backbone and DPT decoder. The training set comprises 540K unlabeled images—less than 1% of the data used for the original DepthAnything series—demonstrating the efficiency of the proposed paradigm. Training is conducted for 20 epochs on 4 RTX 3090 GPUs, with standard AdamW optimization and moderate batch sizes. The encoder is kept frozen during finetuning, which ablation studies show is critical for maintaining robust feature representations.
Perturbations are applied probabilistically, with ablation studies indicating that a combination of all four types (lighting, weather, blur, contrast) yields the best robustness. The spatial distance constraint is implemented using Euclidean distances between patch pairs, and the loss is computed between the student’s perturbed predictions and the teacher’s unperturbed outputs.
Empirical Results
DepthAnything-AC is evaluated on a comprehensive suite of benchmarks:
- Multi-condition DA-2K: On this high-resolution, diverse benchmark with synthetic corruptions, DepthAnything-AC achieves the highest accuracy across all conditions, outperforming both foundation and robust MDE models.
- Real-World Adverse Conditions: On NuScenes-night, Robotcar-night, and DrivingStereo (rain, fog, cloud), DepthAnything-AC consistently surpasses prior models in both AbsRel and δ1 metrics, with notable improvements in challenging nighttime and weather scenarios.
- Synthetic Corruptions (KITTI-C): The model maintains or exceeds state-of-the-art performance under darkness, snow, motion blur, and Gaussian noise.
- General Benchmarks: On KITTI, NYU-D, Sintel, ETH3D, and DIODE, DepthAnything-AC matches the performance of leading foundation models, confirming that robustness enhancements do not compromise generalization.
Ablation studies further validate the necessity of each component, with the spatial distance constraint and perturbation-based consistency both contributing to improved robustness. The model’s feature representations are shown to be more resilient to corruption, as visualized in qualitative analyses.
Implications and Future Directions
DepthAnything-AC demonstrates that robust monocular depth estimation in open-world conditions can be achieved without extensive labeled data or domain-specific finetuning. The unsupervised consistency paradigm, combined with geometric priors, provides a scalable path for adapting foundation models to real-world deployment scenarios, including robotics, autonomous driving, and multi-modal AI systems.
The strong numerical results, particularly on modern, high-quality benchmarks, highlight the limitations of traditional datasets for evaluating foundation MDE models. The approach’s reliance on unlabeled data and synthetic perturbations suggests a practical route for continual adaptation as new environmental challenges emerge.
Future research may explore:
- Extending the spatial distance constraint to 3D geometric reasoning or multi-view consistency.
- Integrating the paradigm with generative or diffusion-based depth models for further robustness.
- Automated selection or generation of perturbation types tailored to specific deployment domains.
- Real-time adaptation and online learning in dynamic environments.
DepthAnything-AC sets a new standard for robust, generalizable monocular depth estimation, providing both methodological insights and practical tools for advancing scene understanding in adverse real-world conditions.