Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)

Published 3 Sep 2023 in cs.CV

Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  2. OpenAI, “GPT-4 technical report,” 2023.
  3. “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
  4. “LAION-5B: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
  5. “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023.
  6. “SmartBrush: Text and shape guided object inpainting with diffusion model,” in CVPR, 2023.
  7. “VinVL: Revisiting visual representations in vision-language models,” in CVPR, 2021.
  8. “Dap: Domain-aware prompt learning for vision-and-language navigation,” arXiv preprint arXiv:2311.17812, 2023.
  9. “Modeling context in referring expressions,” in ECCV, 2016.
  10. “MAttNet: Modular attention network for referring expression comprehension,” in CVPR, 2018.
  11. “Language adaptive weight generation for multi-task visual grounding,” in CVPR, 2023.
  12. “DQ-DETR: Dual query detection transformer for phrase extraction and grounding,” in AAAI, 2023.
  13. “Zero-shot referring image segmentation with global-local context features,” in CVPR, 2023.
  14. “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
  15. “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023.
  16. “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
  17. “Discriminative diffusion models as few-shot vision and language learners,” arXiv preprint arXiv:2305.10722, 2023.
  18. “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016.
  19. “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
  20. “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
  21. “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  22. “CPT: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
  23. “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
  24. “Microsoft COCO: Common objects in context,” in ECCV, 2014.
  25. “An improved non-monotonic transition system for dependency parsing,” in EMNLP, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xuyang Liu (23 papers)
  2. Siteng Huang (31 papers)
  3. Yachen Kang (9 papers)
  4. Honggang Chen (21 papers)
  5. Donglin Wang (103 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.