Harnessing Diffusion Models for Visual Perception with Meta Prompts (2312.14733v1)
Abstract: The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.
- Attention attention everywhere: Monocular depth prediction with skip attention. In WACV, 2023.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint, 2023.
- Diffusiondet: Diffusion model for object detection. In ICCV, 2023.
- Vision transformer adapter for dense predictions. arXiv preprint, 2022.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020a.
- MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020b.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint, 2023.
- Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
- Vision meets robotics: The kitti dataset. IJRR, 2013.
- Deep residual learning for image recognition. In CVPR, 2016.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, 2020.
- Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023a.
- Semask: Semantically masked transformers for semantic segmentation. In ICCV, 2023b.
- Ddp: Diffusion model for dense visual prediction. arXiv preprint, 2023.
- Mesa: Masked, geometric, and supervised pre-training for monocular depth estimation. arXiv preprint, 2023.
- Text-image alignment for diffusion-based perception. arXiv preprint, 2023.
- Denoising diffusion semantic segmentation with mask prior modeling. arXiv preprint, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, 2023a.
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research, 2023b.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Polarized self-attention: Towards high-quality pixel-wise regression. arXiv preprint, 2021a.
- Pseudo numerical methods for diffusion models on manifolds. arXiv preprint, 2022a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
- A convnet for the 2020s. In CVPR, 2022b.
- Decoupled weight decay regularization. arXiv preprint, 2017.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
- The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
- On the convergence of adam and beyond. arXiv preprint, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Monocular depth estimation using diffusion models. arXiv preprint, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. arXiv preprint, 2020.
- Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- Hierarchical multi-scale attention for semantic segmentation. arXiv preprint, 2020.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023.
- Simple baselines for human pose estimation and tracking. In ECCV, 2018a.
- Unified perceptual parsing for scene understanding. In ECCV, 2018b.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
- Vitpose: Simple vision transformer baselines for human pose estimation. NeurIPS, 2022.
- Gedepth: Ground embedding for monocular depth estimation. In ICCV, 2023.
- Mamo: Leveraging memory and attention for monocular video depth estimation. In ICCV, 2023.
- Object-contextual representations for semantic segmentation. In ECCV, 2020.
- Hrformer: High-resolution transformer for dense prediction. In NeurIPS, 2021.
- Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Qiang Wan (9 papers)
- Zilong Huang (42 papers)
- Bingyi Kang (39 papers)
- Jiashi Feng (295 papers)
- Li Zhang (693 papers)