Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (2401.10891v2)

Published 19 Jan 2024 in cs.CV

Abstract: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Mapillary planet-scale depth dataset. In ECCV, 2020.
  2. Beit: Bert pre-training of image transformers. In ICLR, 2022.
  3. Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
  4. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288, 2023.
  5. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460, 2023.
  6. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
  7. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
  8. Virtual kitti 2. arXiv:2001.10773, 2020.
  9. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In CVPR, 2019.
  10. Single-image depth perception in the wild. In NeurIPS, 2016.
  11. Vision transformer adapter for dense predictions. In ICLR, 2023.
  12. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  13. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes. arXiv:2110.11590, 2021.
  14. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  15. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
  18. Vision meets robotics: The kitti dataset. IJRR, 2013.
  19. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  20. 3d packing for self-supervised monocular depth estimation. In CVPR, 2020a.
  21. Semantically-guided representation learning for self-supervised monocular depth. In ICLR, 2020b.
  22. Towards zero-shot scale-aware monocular depth estimation. In ICCV, 2023.
  23. Recovering surface layout from an image. IJCV, 2007.
  24. Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023.
  25. Ddp: Diffusion model for dense visual prediction. In ICCV, 2023.
  26. Segment anything in high quality. In NeurIPS, 2023.
  27. Segment anything. In ICCV, 2023.
  28. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In ECCV, 2020.
  29. Evaluation of cnn-based single-image depth estimation methods. In ECCVW, 2018.
  30. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  31. Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICMLW, 2013.
  32. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In CVPR, 2015.
  33. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018.
  34. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv:2204.00987, 2022.
  35. Microsoft coco: Common objects in context. In ECCV, 2014.
  36. Sift flow: Dense correspondence across different scenes. In ECCV, 2008.
  37. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499, 2023.
  38. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  39. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022a.
  40. A convnet for the 2020s. In CVPR, 2022b.
  41. All in tokens: Unifying output space of visual tasks via soft token. In ICCV, 2023.
  42. Dinov2: Learning robust visual features without supervision. TMLR, 2023.
  43. P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022.
  44. Learning transferable visual models from natural language supervision. In ICML, 2021.
  45. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
  46. Vision transformers for dense prediction. In ICCV, 2021.
  47. The relative importance of depth cues and semantic edges for indoor mobility using simulated prosthetic vision in immersive virtual reality. In VRST, 2022.
  48. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021.
  49. Imagenet large scale visual recognition challenge. IJCV, 2015.
  50. Make3d: Learning 3d scene structure from a single still image. TPAMI, 2008.
  51. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR, 2017.
  52. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  53. Nddepth: Normal-distance assisted monocular depth estimation. In ICCV, 2023.
  54. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  55. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
  56. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
  57. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  58. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  59. Diode: A dense indoor and outdoor depth dataset. arXiv:1908.00463, 2019.
  60. Web stereo video supervision for depth prediction from dynamic scenes. In 3DV, 2019a.
  61. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In ICME, 2021.
  62. Tartanair: A dataset to push the limits of visual slam. In IROS, 2020.
  63. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019b.
  64. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In CVPR, 2020.
  65. Fastdepth: Fast monocular depth estimation on embedded systems. In ICRA, 2019.
  66. Monocular relative depth perception with web stereo data supervision. In CVPR, 2018.
  67. Structure-guided ranking loss for single image depth prediction. In CVPR, 2020.
  68. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  69. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  70. End-to-end semi-supervised object detection with soft teacher. In ICCV, 2021.
  71. Mtformer: Multi-task learning via transformer and cross-task reasoning. In ECCV, 2022.
  72. Billion-scale semi-supervised learning for image classification. arXiv:1905.00546, 2019.
  73. St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, 2022.
  74. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR, 2023a.
  75. Gedepth: Ground embedding for monocular depth estimation. In ICCV, 2023b.
  76. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020.
  77. Enforcing geometric constraints of virtual normal for depth prediction. In ICCV, 2019.
  78. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  79. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In ICLR, 2020.
  80. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015.
  81. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.
  82. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv:2203.01502, 2022.
  83. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  84. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  85. Recognize anything: A strong image tagging model. arXiv:2306.03514, 2023b.
  86. Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
  87. Places: A 10 million image database for scene recognition. TPAMI, 2017a.
  88. Scene parsing through ade20k dataset. In CVPR, 2017b.
  89. Rethinking pre-training and self-training. In NeurIPS, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lihe Yang (12 papers)
  2. Bingyi Kang (39 papers)
  3. Zilong Huang (43 papers)
  4. Xiaogang Xu (63 papers)
  5. Jiashi Feng (297 papers)
  6. Hengshuang Zhao (118 papers)
Citations (381)

Summary

  • The paper demonstrates that scaling up unlabeled data for monocular depth estimation yields a robust foundation model across diverse scenes.
  • The paper employs innovative data augmentation and feature alignment techniques to imbue the model with strong semantic priors and enhanced generalization.
  • The resulting model achieves state-of-the-art performance on benchmarks like NYUv2 and KITTI, while showing promise as a versatile multi-task encoder.

Overview

In the field of computer vision, monocular depth estimation (MDE) plays a crucial role across a wide array of applications including robotics, autonomous driving, and virtual reality. A recent work titled "Depth Anything" presents a significant stride in the field, proposing a simple yet effective approach to build a foundation model that deals with any images under any circumstances by leveraging large-scale unlabeled data.

Methodology

The core proposition of "Depth Anything" lies in the concept of dataset scale-up through mining massive volumes of unlabeled data, which is easy to collect and can cover a diverse range of scenes, thus aiding the model generalization capability. To utilize such data effectively, the authors adopt two strategies:

  1. Data Augmentation for Robust Representations: They introduce strong perturbations like color distortions and spatial distortions (CutMix) to create a challenging optimization target during the re-training phase, which pushes the model to actively acquire extra visual knowledge and robustness.
  2. Semantic Priors from Pre-trained Encoders: An auxiliary supervision mechanism is proposed to enforce the model to inherit semantic knowledge from pre-trained encoders, replacing the traditional auxiliary semantic segmentation task. The authors opt for feature alignment loss to ensure the model captures more informative semantic signals without compromising the part-level discriminative representation crucial for depth estimation.

Results

The "Depth Anything" model demonstrates remarkable generalization abilities across various public datasets and everyday photos. When fine-tuning this model with specific metric depth information from well-known datasets such as NYUv2 and KITTI, it achieved state-of-the-art (SOTA) results, surpassing previous models significantly. Additionally, by coupling this improved depth model with a controller (ControlNet), enhanced image synthesis results were obtained, showcasing the practical applicability of the method.

Implications

Beyond MDE, the pre-trained encoders in the "Depth Anything" model, due to their feature alignment strategy, possess substantial potential as a universal multi-task encoder for various perception tasks in computer vision. This model paves the way for robust vision systems capable of understanding complex scenarios with scarce or noisy labels, thus expanding the horizons for AI systems to perceive and interact with their environment more effectively.

Concluding Thoughts

The "Depth Anything" model represents a significant advancement in utilizing unlabeled images to improve the performance of monocular depth estimation. Its impressive zero-shot learning capabilities, coupled with the model's versatility as a pre-trained encoder for downstream tasks, mark a pivotal moment in the development of foundational models in computer vision. The release of this model is a step towards addressing the pervasive challenge of data scarcity and variability in real-world applications.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com