Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (2401.10891v2)

Published 19 Jan 2024 in cs.CV

Abstract: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

References (89)

Authors (6)

Lihe Yang (12 papers)
Bingyi Kang (39 papers)
Zilong Huang (43 papers)
Xiaogang Xu (63 papers)
Jiashi Feng (297 papers)
Hengshuang Zhao (118 papers)

Citations (381)

View on Semantic Scholar

Summary

The paper demonstrates that scaling up unlabeled data for monocular depth estimation yields a robust foundation model across diverse scenes.
The paper employs innovative data augmentation and feature alignment techniques to imbue the model with strong semantic priors and enhanced generalization.
The resulting model achieves state-of-the-art performance on benchmarks like NYUv2 and KITTI, while showing promise as a versatile multi-task encoder.

Overview

In the field of computer vision, monocular depth estimation (MDE) plays a crucial role across a wide array of applications including robotics, autonomous driving, and virtual reality. A recent work titled "Depth Anything" presents a significant stride in the field, proposing a simple yet effective approach to build a foundation model that deals with any images under any circumstances by leveraging large-scale unlabeled data.

Methodology

The core proposition of "Depth Anything" lies in the concept of dataset scale-up through mining massive volumes of unlabeled data, which is easy to collect and can cover a diverse range of scenes, thus aiding the model generalization capability. To utilize such data effectively, the authors adopt two strategies:

Data Augmentation for Robust Representations: They introduce strong perturbations like color distortions and spatial distortions (CutMix) to create a challenging optimization target during the re-training phase, which pushes the model to actively acquire extra visual knowledge and robustness.
Semantic Priors from Pre-trained Encoders: An auxiliary supervision mechanism is proposed to enforce the model to inherit semantic knowledge from pre-trained encoders, replacing the traditional auxiliary semantic segmentation task. The authors opt for feature alignment loss to ensure the model captures more informative semantic signals without compromising the part-level discriminative representation crucial for depth estimation.

Results

The "Depth Anything" model demonstrates remarkable generalization abilities across various public datasets and everyday photos. When fine-tuning this model with specific metric depth information from well-known datasets such as NYUv2 and KITTI, it achieved state-of-the-art (SOTA) results, surpassing previous models significantly. Additionally, by coupling this improved depth model with a controller (ControlNet), enhanced image synthesis results were obtained, showcasing the practical applicability of the method.

Implications

Beyond MDE, the pre-trained encoders in the "Depth Anything" model, due to their feature alignment strategy, possess substantial potential as a universal multi-task encoder for various perception tasks in computer vision. This model paves the way for robust vision systems capable of understanding complex scenarios with scarce or noisy labels, thus expanding the horizons for AI systems to perceive and interact with their environment more effectively.

Concluding Thoughts

The "Depth Anything" model represents a significant advancement in utilizing unlabeled images to improve the performance of monocular depth estimation. Its impressive zero-shot learning capabilities, coupled with the model's versatility as a pre-trained encoder for downstream tasks, mark a pivotal moment in the development of foundational models in computer vision. The release of this model is a step towards addressing the pervasive challenge of data scarcity and variability in real-world applications.