Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 175 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Depth Anything at Any Condition (2507.01634v1)

Published 2 Jul 2025 in cs.CV and cs.AI

Abstract: We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: https://ghost233lism.github.io/depthanything-AC-page Code: https://github.com/HVision-NKU/DepthAnythingAC

Summary

The paper introduces a novel unsupervised finetuning paradigm with perturbation-based consistency to boost robustness in monocular depth estimation.
The model employs a spatial distance constraint that preserves semantic boundaries and structural details even under significant image corruptions.
Empirical evaluations show that DepthAnything-AC outperforms existing models on diverse benchmarks, particularly in adverse weather and low-light conditions.

DepthAnything-AC: Advancing Monocular Depth Estimation Robustness in Adverse Conditions

DepthAnything-AC introduces a significant advancement in monocular depth estimation (MDE) by addressing the persistent challenge of robust depth prediction under complex, real-world conditions such as adverse weather, illumination changes, and sensor-induced distortions. While recent foundation MDE models have demonstrated strong zero-shot generalization in standard scenarios, their performance degrades notably in the presence of image corruptions and environmental variability. DepthAnything-AC proposes a novel unsupervised finetuning paradigm and a geometric constraint to bridge this robustness gap, all while maintaining generalization on standard benchmarks.

Methodological Contributions

DepthAnything-AC is built upon two core innovations:

Perturbation-Based Consistency Regularization The model leverages a small, unlabeled dataset from general scenes and applies a diverse set of perturbations—simulating lighting changes, weather effects, blur, and contrast variations—to each image. The training objective enforces consistency between the model’s predictions on the original and perturbed images. This is formalized as an affine-invariant loss on the predicted disparities, ensuring the model learns invariance to these corruptions without requiring ground-truth depth for the perturbed data. To prevent drift from the original model’s capabilities, a knowledge distillation loss is applied, using the frozen pre-trained model as a teacher for unperturbed images.
Spatial Distance Constraint (SDR) Recognizing that per-pixel losses are insufficient for capturing object boundaries and structural details—especially under corruption—the authors introduce a spatial distance constraint. This constraint computes a geometric distance matrix between all patch pairs, combining both positional and predicted depth differences. The model is penalized if the spatial distance relations in perturbed images deviate from those in the original, as predicted by the frozen teacher. This encourages the network to preserve semantic boundaries and object structures, even when texture information is degraded.

The overall loss is a weighted sum of the consistency, knowledge distillation, and spatial distance losses, with empirical results showing insensitivity to the precise weighting.

Implementation and Training

DepthAnything-AC is finetuned from DepthAnythingV2, using a ViT-S backbone and DPT decoder. The training set comprises 540K unlabeled images—less than 1% of the data used for the original DepthAnything series—demonstrating the efficiency of the proposed paradigm. Training is conducted for 20 epochs on 4 RTX 3090 GPUs, with standard AdamW optimization and moderate batch sizes. The encoder is kept frozen during finetuning, which ablation studies show is critical for maintaining robust feature representations.

Perturbations are applied probabilistically, with ablation studies indicating that a combination of all four types (lighting, weather, blur, contrast) yields the best robustness. The spatial distance constraint is implemented using Euclidean distances between patch pairs, and the loss is computed between the student’s perturbed predictions and the teacher’s unperturbed outputs.

Empirical Results

DepthAnything-AC is evaluated on a comprehensive suite of benchmarks:

Multi-condition DA-2K: On this high-resolution, diverse benchmark with synthetic corruptions, DepthAnything-AC achieves the highest accuracy across all conditions, outperforming both foundation and robust MDE models.
Real-World Adverse Conditions: On NuScenes-night, Robotcar-night, and DrivingStereo (rain, fog, cloud), DepthAnything-AC consistently surpasses prior models in both AbsRel and $\delta_1$ metrics, with notable improvements in challenging nighttime and weather scenarios.
Synthetic Corruptions (KITTI-C): The model maintains or exceeds state-of-the-art performance under darkness, snow, motion blur, and Gaussian noise.
General Benchmarks: On KITTI, NYU-D, Sintel, ETH3D, and DIODE, DepthAnything-AC matches the performance of leading foundation models, confirming that robustness enhancements do not compromise generalization.

Ablation studies further validate the necessity of each component, with the spatial distance constraint and perturbation-based consistency both contributing to improved robustness. The model’s feature representations are shown to be more resilient to corruption, as visualized in qualitative analyses.

Implications and Future Directions

DepthAnything-AC demonstrates that robust monocular depth estimation in open-world conditions can be achieved without extensive labeled data or domain-specific finetuning. The unsupervised consistency paradigm, combined with geometric priors, provides a scalable path for adapting foundation models to real-world deployment scenarios, including robotics, autonomous driving, and multi-modal AI systems.

The strong numerical results, particularly on modern, high-quality benchmarks, highlight the limitations of traditional datasets for evaluating foundation MDE models. The approach’s reliance on unlabeled data and synthetic perturbations suggests a practical route for continual adaptation as new environmental challenges emerge.

Future research may explore:

Extending the spatial distance constraint to 3D geometric reasoning or multi-view consistency.
Integrating the paradigm with generative or diffusion-based depth models for further robustness.
Automated selection or generation of perturbation types tailored to specific deployment domains.
Real-time adaptation and online learning in dynamic environments.

DepthAnything-AC sets a new standard for robust, generalizable monocular depth estimation, providing both methodological insights and practical tools for advancing scene understanding in adverse real-world conditions.