Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (2401.10891v2)
Abstract: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.
- Mapillary planet-scale depth dataset. In ECCV, 2020.
- Beit: Bert pre-training of image transformers. In ICLR, 2022.
- Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288, 2023.
- Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460, 2023.
- On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
- A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
- Virtual kitti 2. arXiv:2001.10773, 2020.
- Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In CVPR, 2019.
- Single-image depth perception in the wild. In NeurIPS, 2016.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes. arXiv:2110.11590, 2021.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
- Vision meets robotics: The kitti dataset. IJRR, 2013.
- Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
- 3d packing for self-supervised monocular depth estimation. In CVPR, 2020a.
- Semantically-guided representation learning for self-supervised monocular depth. In ICLR, 2020b.
- Towards zero-shot scale-aware monocular depth estimation. In ICCV, 2023.
- Recovering surface layout from an image. IJCV, 2007.
- Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023.
- Ddp: Diffusion model for dense visual prediction. In ICCV, 2023.
- Segment anything in high quality. In NeurIPS, 2023.
- Segment anything. In ICCV, 2023.
- Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In ECCV, 2020.
- Evaluation of cnn-based single-image depth estimation methods. In ECCVW, 2018.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICMLW, 2013.
- Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In CVPR, 2015.
- Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018.
- Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv:2204.00987, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Sift flow: Dense correspondence across different scenes. In ECCV, 2008.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022a.
- A convnet for the 2020s. In CVPR, 2022b.
- All in tokens: Unifying output space of visual tasks via soft token. In ICCV, 2023.
- Dinov2: Learning robust visual features without supervision. TMLR, 2023.
- P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
- Vision transformers for dense prediction. In ICCV, 2021.
- The relative importance of depth cues and semantic edges for indoor mobility using simulated prosthetic vision in immersive virtual reality. In VRST, 2022.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021.
- Imagenet large scale visual recognition challenge. IJCV, 2015.
- Make3d: Learning 3d scene structure from a single still image. TPAMI, 2008.
- A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR, 2017.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Nddepth: Normal-distance assisted monocular depth estimation. In ICCV, 2023.
- Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Diode: A dense indoor and outdoor depth dataset. arXiv:1908.00463, 2019.
- Web stereo video supervision for depth prediction from dynamic scenes. In 3DV, 2019a.
- Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In ICME, 2021.
- Tartanair: A dataset to push the limits of visual slam. In IROS, 2020.
- Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019b.
- Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In CVPR, 2020.
- Fastdepth: Fast monocular depth estimation on embedded systems. In ICRA, 2019.
- Monocular relative depth perception with web stereo data supervision. In CVPR, 2018.
- Structure-guided ranking loss for single image depth prediction. In CVPR, 2020.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- End-to-end semi-supervised object detection with soft teacher. In ICCV, 2021.
- Mtformer: Multi-task learning via transformer and cross-task reasoning. In ECCV, 2022.
- Billion-scale semi-supervised learning for image classification. arXiv:1905.00546, 2019.
- St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, 2022.
- Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR, 2023a.
- Gedepth: Ground embedding for monocular depth estimation. In ICCV, 2023b.
- Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020.
- Enforcing geometric constraints of virtual normal for depth prediction. In ICCV, 2019.
- Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
- Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In ICLR, 2020.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, 2020.
- New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv:2203.01502, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Recognize anything: A strong image tagging model. arXiv:2306.03514, 2023b.
- Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
- Places: A 10 million image database for scene recognition. TPAMI, 2017a.
- Scene parsing through ade20k dataset. In CVPR, 2017b.
- Rethinking pre-training and self-training. In NeurIPS, 2020.
- Lihe Yang (12 papers)
- Bingyi Kang (39 papers)
- Zilong Huang (43 papers)
- Xiaogang Xu (63 papers)
- Jiashi Feng (297 papers)
- Hengshuang Zhao (118 papers)