VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)
Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- OpenAI, “GPT-4 technical report,” 2023.
- “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
- “LAION-5B: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
- “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023.
- “SmartBrush: Text and shape guided object inpainting with diffusion model,” in CVPR, 2023.
- “VinVL: Revisiting visual representations in vision-language models,” in CVPR, 2021.
- “Dap: Domain-aware prompt learning for vision-and-language navigation,” arXiv preprint arXiv:2311.17812, 2023.
- “Modeling context in referring expressions,” in ECCV, 2016.
- “MAttNet: Modular attention network for referring expression comprehension,” in CVPR, 2018.
- “Language adaptive weight generation for multi-task visual grounding,” in CVPR, 2023.
- “DQ-DETR: Dual query detection transformer for phrase extraction and grounding,” in AAAI, 2023.
- “Zero-shot referring image segmentation with global-local context features,” in CVPR, 2023.
- “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
- “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023.
- “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
- “Discriminative diffusion models as few-shot vision and language learners,” arXiv preprint arXiv:2305.10722, 2023.
- “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016.
- “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
- “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
- “CPT: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
- “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
- “Microsoft COCO: Common objects in context,” in ECCV, 2014.
- “An improved non-monotonic transition system for dependency parsing,” in EMNLP, 2015.
- Xuyang Liu (23 papers)
- Siteng Huang (31 papers)
- Yachen Kang (9 papers)
- Honggang Chen (21 papers)
- Donglin Wang (103 papers)