DepthAnything-AC: Robust Depth Estimation
- DepthAnything-AC is a robust depth estimation architecture that employs unsupervised regularization and physics-inspired augmentations to generalize across adverse environments.
- It integrates a transformer-based backbone with a dense prediction decoder to produce detailed, per-pixel depth maps while preserving semantic boundaries.
- The model achieves zero-shot accuracy on real-world, corrupted, and synthetic benchmarks, making it ideal for applications in autonomous navigation and robotics.
Depth Anything at Any Condition (DepthAnything-AC) refers to a foundation-level monocular depth estimation architecture and training paradigm designed to produce robust, generalizable, and detailed depth predictions under a wide spectrum of environmental, adverse, and sensor-induced conditions. DepthAnything-AC builds on the DepthAnythingV2 backbone, introducing unsupervised regularization, specialized geometric constraints, and targeted data augmentation to deliver zero-shot and consistent accuracy across conventional, adverse weather, corrupted, and synthetic benchmarks while requiring dramatically reduced supervision.
1. Architecture and Training Paradigm
DepthAnything-AC employs a transformer-based backbone (ViT-S) inherited from DepthAnythingV2, using a DPT (Dense Prediction Transformer) decoder for producing dense per-pixel depth predictions. The architectural core prioritizes effective patch-level feature representation and semantic boundary preservation under a variety of visual perturbations.
To circumvent the scarcity and unreliability of labeled/pseudo-labeled data in difficult conditions (e.g., severe weather, extreme illumination), DepthAnything-AC introduces an unsupervised consistency regularization paradigm:
- Perturbation-based Regularization: Each unlabeled sample is perturbed using stochastic, physics-inspired augmentations simulating darkness, fog, snow, blur, and contrast changes. Both weakly and strongly augmented versions are generated for each input.
- Consistency Objective: The network is trained to minimize the discrepancy between outputs from weak and strongly perturbed versions of the same image, enforcing invariance to environmental corruption.
- Knowledge Distillation Loss: To ensure that general scene performance does not degrade, the original (frozen) foundation model’s predictions on undisturbed images act as a target for the student model, providing “anchor” supervision even when labels are unavailable in the corrupted domain.
The paradigm uses only moderate-scale unlabeled data (540k images), a significant reduction compared to prior works that typically require tens of millions of samples.
2. Spatial Distance Constraint and Geometric Prior
A central innovation is the Spatial Distance Constraint, which operationalizes spatial structure as an explicit objective during model training. This moves beyond conventional per-pixel loss and introduces patch-level geometry awareness:
- Spatial Distance Relation (SDR): For each image, relations between patch pairs are characterized both in pixel-space (Euclidean distance) and in disparity/depth prediction differences.
- Spatial Distance Loss: Enforces that geometric (SDR) relations in perturbed images remain consistent with those predicted by the frozen model on weakly augmented images:
This approach targets semantic boundary sharpness and fine structural detail, commonly vulnerable to adversarial or noisy input transformations.
3. Training Losses and Optimization
The total training objective integrates three components:
- Unsupervised Consistency Loss (): Affine-invariant discrepancy between weakly and strongly perturbed predictions.
- Knowledge Distillation Loss (): Affine-invariant discrepancy between the new and frozen model outputs on weakly augmented data.
- Spatial Distance Loss (): Patch-level geometric consistency as described above.
The aggregated loss is
where typically .
The affine-invariant loss is defined as
with normalization via the mean () and mean absolute deviation ().
4. Robust Data Augmentation for Coverage of Diverse Conditions
The perturbation schedule includes:
- Illumination changes: Simulating night/darkness and sensor noise.
- Weather conditions: Fog generated with diamond-square noise, snow via spatial Gaussian.
- Blurs: Motion and zoom blurs achieved through controlled filtering.
- Contrast shifts: Centered transformations simulating exposure and color variation.
Augmentations are applied stochastically, ensuring the model is exposed to, and forced to develop invariance against, a wide suite of adverse real-world and sensor-induced corruptions.
5. Empirical Performance and Generalization
DepthAnything-AC demonstrates leading or competitive zero-shot accuracy in a suite of challenging benchmarks:
- DA-2K Multi-Condition: Highest accuracy across snow, fog, blur, darkness corruptions relative to all prior monocular and foundation models.
- Real-World Datasets (NuScenes-night, RobotCar-night, DrivingStereo-rain/fog/cloud): Best or near-best absolute relative error (AbsRel) and threshold accuracy (), with notable gains in boundary preservation (see paper Table 2).
- Synthetic Corruption (KITTI-C): Small but consistent improvements over leading foundation models under controlled corruption.
- General Scene Benchmarks (KITTI, NYU-D, ETH3D, DIODE, Sintel): Maintains parity with specialized foundation models, indicating no trade-off in performance under unperturbed conditions.
Ablation studies confirm that perturbation-based consistency and the spatial distance constraint are critical for the observed improvements, particularly at semantic object boundaries and in the presence of severe environmental or sensor noise. Freezing the encoder during fine-tuning is essential for avoiding overfitting and maximizing generalization.
6. Practical Relevance, Applications, and Accessibility
DepthAnything-AC is designed to deliver robust monocular depth cues across domains commonly encountered in robotics, autonomous systems, and general computer vision:
- Autonomous navigation: Reliable estimation in night, rain, fog, or mixed-adverse scenarios.
- Robotics: Maintains accuracy under lighting and sensory degradation, essential for manipulation and SLAM.
- Multimodal content understanding: Provides dense depth cues for downstream AI models in content creation, AR, VR, and vision-language tasks.
The approach is highly data-efficient; it reaches or exceeds benchmarks set by earlier depth foundation models with less than 1% of their training data (~540k images). All code, trained models, and supplementary material are accessible at the project web page and GitHub repository.
Model Component | Description/Setting |
---|---|
Backbone | ViT-S encoder, DPT decoder (DepthAnythingV2) |
Loss | Affine-invariant, consistency, distillation, SDR |
Fine-tuning Data | 540k unlabeled images (random scenes) |
Evaluation Benchmarks | DA-2K, KITTI, NYU-D, NuScenes-night, KITTI-C, etc. |
Practical Highlights | Boundary/detail robustness; all-condition handling |
Code/Project | Webpage and open-source on GitHub |
7. Accessibility and Future Directions
DepthAnything-AC is fully open-sourced: all models, codebase, and usage instructions are hosted at https://ghost233lism.github.io/depthanything-AC-page and https://github.com/HVision-NKU/DepthAnythingAC. The approach fosters extensibility, supporting ongoing improvement via later backbones, tailored augmentations, and integration into larger multimodal or perception frameworks.
The empirical results indicate no trade-off between robustness to adverse conditions and standard scene accuracy, validating the paradigm’s practicality for universal monocular depth estimation under real-world, open-set, multi-condition scenarios—a central goal for the "Depth Anything at Any Condition" vision.