Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)

Published 3 Sep 2023 in cs.CV

Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  2. OpenAI, “GPT-4 technical report,” 2023.
  3. “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
  4. “LAION-5B: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
  5. “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023.
  6. “SmartBrush: Text and shape guided object inpainting with diffusion model,” in CVPR, 2023.
  7. “VinVL: Revisiting visual representations in vision-language models,” in CVPR, 2021.
  8. “Dap: Domain-aware prompt learning for vision-and-language navigation,” arXiv preprint arXiv:2311.17812, 2023.
  9. “Modeling context in referring expressions,” in ECCV, 2016.
  10. “MAttNet: Modular attention network for referring expression comprehension,” in CVPR, 2018.
  11. “Language adaptive weight generation for multi-task visual grounding,” in CVPR, 2023.
  12. “DQ-DETR: Dual query detection transformer for phrase extraction and grounding,” in AAAI, 2023.
  13. “Zero-shot referring image segmentation with global-local context features,” in CVPR, 2023.
  14. “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
  15. “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023.
  16. “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
  17. “Discriminative diffusion models as few-shot vision and language learners,” arXiv preprint arXiv:2305.10722, 2023.
  18. “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016.
  19. “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
  20. “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
  21. “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  22. “CPT: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
  23. “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
  24. “Microsoft COCO: Common objects in context,” in ECCV, 2014.
  25. “An improved non-monotonic transition system for dependency parsing,” in EMNLP, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xuyang Liu (23 papers)
  2. Siteng Huang (31 papers)
  3. Yachen Kang (9 papers)
  4. Honggang Chen (21 papers)
  5. Donglin Wang (103 papers)
Citations (10)