DepthAnything-AC: Robust Depth Estimation
- The paper introduces an unsupervised finetuning paradigm that leverages perturbation consistency and spatial distance constraints to achieve strong zero-shot depth estimation under adverse conditions.
- It employs a frozen ViT-S encoder with a DPT decoder to reduce reliance on labeled data while preserving sharp object boundaries through patch-level geometric modeling.
- Empirical results show superior performance on challenging benchmarks—including adverse weather and low-light scenarios—with data efficiency from using only 540,000 unlabeled images.
DepthAnything-AC, short for "Depth Anything at Any Condition," refers to a foundation monocular depth estimation (MDE) model designed to generalize robustly across a vast spectrum of environmental conditions, including adverse weather, illumination variations, and sensor-induced distortions. Built atop the DepthAnything V2 architecture, the model introduces unsupervised finetuning techniques and novel geometric constraints that enable strong zero-shot performance on both challenging and standard benchmarks, all while requiring only moderate quantities of unlabeled general-domain data.
1. Foundation Model Architecture and Training Paradigm
DepthAnything-AC is based on the DepthAnything V2 family, utilizing a Vision Transformer Small (ViT-S) backbone as encoder (frozen during finetuning) and a DPT (Dense Prediction Transformer) decoder. The pipeline is specifically designed to avoid reliance on vast labeled datasets or pseudo-labels, which are often unreliable under severe real-world degradations. Instead, it employs an unsupervised, perturbation-based consistency regularization approach, supplemented by a novel spatial distance constraint, all trained with a combined, modular loss.
The key components are:
- Encoder: Pre-trained and frozen ViT-S to preserve general-purpose visual features, thus enhancing robustness when adapting to new conditions.
- Decoder: Standard DPT head, supporting dense depth prediction for all input pixels.
- Training: A dual-branch data pipeline where, for each input image, one branch applies standard augmentations (clean), and the other applies strong, domain-relevant perturbations simulating real-world corruptions (e.g., darkness, rain, blur).
2. Unsupervised Consistency Regularization
To address data scarcity and the infeasibility of obtaining reliable annotations for highly degraded conditions, the model leverages unlabeled general-domain imagery. Its central training paradigm enforces consistency of depth predictions between the cleanly augmented and heavily perturbed versions of the same input.
- Perturbation-driven consistency: For each image , two views are generated:
- : clean augmentations.
- : additional synthetic corruption (e.g., simulating adverse weather, strong blur, contrast changes).
- Consistency Loss:
where is an affine-invariant loss allowing depth consistency up to scale and shift, so the model learns relative depth unaffected by absolute perturbations.
- Knowledge Distillation on Clean Data: Prevents the model from degrading performance on in-domain tasks by aligning its outputs on clean images to those of the initial frozen model:
where is the frozen base model.
- Overall Loss Function:
with typically .
This approach, requiring only 540,000 unlabeled images for finetuning, achieves broad robustness and data efficiency, contrasting with the labor-intensive, 60M+ image pretraining of previous models.
3. Spatial Distance Constraint for Patch-level Geometry
A core innovation is the spatial distance constraint, which regularizes not individual pixel values but relative geometric relationships between patches in the depth map, addressing shortcomings in boundary sharpness and preserving spatial structure under adverse conditions.
- Spatial Distance Relation Matrix ():
- Models both spatial proximity and depth similarity for all patch pairs.
- Position Relation ():
encodes spatial distances between patches. - Depth Relation ():
encodes absolute disparity difference. - Combined Relation:
- Spatial Distance Constraint Loss:
This encourages the geometry and structure of the perturbed-image depth map to match that of the clean input, fostering robust object structure awareness and sharper semantic boundaries.
4. Empirical Performance and Benchmark Evaluations
DepthAnything-AC demonstrates strong zero-shot generalization and state-of-the-art performance across various evaluation settings:
- DA-2K and Corrupted DA-2K (dark, fog, snow, blur): Outperforms all prior foundation and robust MDE models on accuracy (δ1), e.g., $0.953$ on clean, $0.923$ on dark, $0.929$ on fog.
- Real-World Adverse Benchmarks (NuScenes-night, Robotcar-night, DrivingStereo-rain/fog/cloud): Achieves higher δ1 and lower AbsRel scores than competitors, especially under challenging nighttime or weathered scenes.
- Synthetic Corruption Benchmarks (KITTI-C): Matches or slightly exceeds top models under severe synthetic perturbations.
- General Datasets (KITTI, NYUv2, Sintel, ETH3D, DIODE): Performance remains competitive with state-of-the-art, indicating no trade-off in generalizability to standard conditions.
- Data Efficiency: Robustness is achieved with only 0.8% as much finetuning data as used in previous foundation depth models.
Ablation studies confirm that each component—perturbation consistency, knowledge distillation, and especially the spatial distance constraint—incrementally and cumulatively enhance robustness and qualitative performance. Freezing the encoder is also critical for preserving generalizable features.
5. Applications and Broader Impact
The DepthAnything-AC methodology enables robust monocular depth estimation for diverse downstream applications:
- Autonomous Driving: Maintains reliability across rain, snow, night, and sensor-induced noise.
- Robotics: Ensures stable depth for navigation and interaction under variable real-world conditions.
- Augmented Reality and 3D Reconstruction: Provides sharp depth boundaries for object segmentation and scene composition.
- AI-Generated Content: Supports realistic spatial modeling in synthetic digital scenes.
- Multi-modal AI Systems: Enhances scene understanding in LLMs with vision, benefiting agents that require geometric awareness.
The spatial distance constraint, in particular, is suggestive of further advancements—this suggests that incorporating patch-wise geometric modeling may be broadly valuable for other dense prediction tasks requiring fidelity at object boundaries and regions of low texture.
6. Methodological Innovations and Efficiency
Component | Description | Benefit |
---|---|---|
Perturbation Consistency | Enforces identical outputs for corrupted inputs | Robustness to real-world noise |
Knowledge Distillation | Aligns with base model on clean data | Prevents loss of generalization |
Spatial Distance Constraint | Patch-wise geometric relation regularization | Sharper object boundaries |
Data Efficiency | Requires only ~0.8% prior model data for adaptation | Low-cost, scalable finetuning |
DepthAnything-AC sets a precedent for scalable and efficient transfer of foundation models to open-world scenarios and previously underrepresented conditions, with no need for paired corrupt/clean data or labeled depth during adaptation.
7. Future Prospects
- Extension to Other Dense Prediction Tasks: The demonstrated benefit of spatial-geometric regularization via SDR may motivate research in semantic segmentation, normal estimation, and related tasks.
- Open-world Model Adaptation: The scalable, unsupervised finetuning paradigm is adaptable for future foundation models, facilitating practical deployments in new environments with minimal annotation labor.
- Benchmark Development: The effectiveness of DepthAnything-AC on modern, high-quality, multi-condition datasets (e.g., DA-2K) highlights the importance of such resources for progress in robust perception research.
DepthAnything-AC exemplifies a practical, theoretically motivated approach to delivering robust monocular depth estimation under any condition. Its architecture, training strategy, and loss design are directly substantiated by the technical content of the referenced work.