Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Implicit and Explicit Language Guidance for Diffusion-based Visual Perception (2404.07600v3)

Published 11 Apr 2024 in cs.CV

Abstract: Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$\text{ss}$ score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Photorealistic text-to-image diffusion models with deep language understanding. In International Conference on Neural Information Processing Systems, pages 36479–36494, 2022.
  2. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  3. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  4. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  5. Laion-5b: An open large-scale dataset for training next generation image-text models. In International Conference on Neural Information Processing Systems, pages 25278–25294, 2022.
  6. Unleashing text-to-image diffusion models for visual perception. In IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023.
  7. Text-image alignment for diffusion-based perception. arXiv preprint arXiv:2310.00031, 2023.
  8. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  9. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
  10. A generalist framework for panoptic segmentation of images and videos. In IEEE/CVF International Conference on Computer Vision, pages 909–919, 2023.
  11. Scene parsing through ADE20K dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5122–5130, 2017.
  12. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760, 2012.
  13. Fully convolutional networks for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  14. Rethinking atrous convolution for semantic image segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 801–818, 2017.
  15. Triply supervised decoder networks for joint detection and segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7392–7401, 2019.
  16. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349–3364, 2021.
  17. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
  18. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
  19. Dual attention network for scene segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
  20. CCNet: Criss-cross attention for semantic segmentation. In IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019.
  21. OCNet: Object context for semantic segmentation. International Journal of Computer Vision, 129:2375–2398, 2021.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  23. Hrformer: High-resolution transformer for dense prediction. In International Conference on Neural Information Processing Systems, pages 7281–7293, 2021.
  24. Multi-scale high-resolution vision transformer for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  25. Segformer: Simple and efficient design for semantic segmentation with transformers. In International Conference on Neural Information Processing Systems, 2021.
  26. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6881–6890, 2021.
  27. Segmenter: Transformer for semantic segmentation. In IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
  28. Depth map prediction from a single image using a multi-scale deep network. In International Conference on Neural Information Processing Systems, pages 2366–2374, 2014.
  29. Structured attention guided convolutional neural fields for monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3917–3925, 2018.
  30. Adabins: Depth estimation using adaptive bins. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  31. Occlusion-aware depth estimation with adaptive normal constraints. In European Conference on Computer Vision, pages 640–657, 2020.
  32. Enforcing geometric constraints of virtual normal for depth prediction. In IEEE/CVF International Conference on Computer Vision, pages 5684–5693, 2019.
  33. Learning to recover 3d scene shape from a single image. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021.
  34. Denoising diffusion probabilistic models. In International Conference on Neural Information Processing Systems, pages 6840–6851, 2020.
  35. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  36. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In International Conference on Neural Information Processing Systems, pages 5775–5787, 2022.
  37. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  38. Diffusiondet: Diffusion model for object detection. arXiv:2211.09788, 2022.
  39. Segdiff: Image segmentation with diffusion probabilistic models. arXiv:2112.00390, 2021.
  40. Ddp: Diffusion model for dense visual prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21741–21752, 2023.
  41. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2021.
  42. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, pages 2955–2966, 2023.
  43. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  44. A convnet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
  45. Denseclip: Language-guided dense prediction with context-aware prompting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
  46. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  47. Revealing the dark secrets of masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
  48. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  49. Vision transformers for dense prediction. In IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021.
  50. P3depth: Monocular depth estimation with a piecewise planarity prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1610–1621, 2022.
  51. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  52. All in tokens: Unifying output space of visual tasks via soft token. In IEEE/CVF International Conference on Computer Vision, pages 19900–19910, 2023.
  53. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  54. Ddp: Diffusion model for dense visual prediction. arXiv:2303.17559, 2023.
  55. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hefeng Wang (16 papers)
  2. Jiale Cao (38 papers)
  3. Jin Xie (76 papers)
  4. Aiping Yang (6 papers)
  5. Yanwei Pang (67 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com