VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)

Published 3 Sep 2023 in cs.CV

Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (25)

Authors (5)

Xuyang Liu (23 papers)
Siteng Huang (31 papers)
Yachen Kang (9 papers)
Honggang Chen (21 papers)
Donglin Wang (103 papers)

Citations (10)

View on Semantic Scholar

GitHub

GitHub - xuyang-liu16/VGDiffZero: [ICASSP 2024] VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (9 stars)

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)

Related Papers

GitHub