Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (2312.15043v1)

Published 22 Dec 2023 in cs.CV

Abstract: Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086.
  2. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3558–3568.
  3. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, 104–120. Springer.
  4. Cops-ref: A new dataset and task on compositional referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10086–10095.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  6. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1769–1779.
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  8. Open-Vocabulary Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2104.13921.
  9. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5356–5364.
  10. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In Conference on Robot Learning.
  11. VLMAE: Vision-Language Masked Autoencoder. arXiv preprint arXiv:2208.09374.
  12. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, 1373–1378.
  13. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1780–1790.
  14. Bilinear attention networks. Advances in neural information processing systems, 31.
  15. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
  16. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
  17. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7): 1956–1981.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
  19. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34: 9694–9705.
  20. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10965–10975.
  21. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10880–10889.
  22. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  23. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4673–4682.
  24. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4185–4194.
  25. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia, 539–547.
  26. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 11–20.
  27. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
  28. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, 2641–2649.
  29. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In ACL (demo).
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  31. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4694–4703.
  32. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
  33. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
  34. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
  35. Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2253–2261.
  36. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5198–5215.
  37. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE transactions on pattern analysis and machine intelligence, 43(11): 4189–4195.
  38. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5100–5111.
  39. Attention is all you need. Advances in neural information processing systems, 30.
  40. Position-guided Text Prompt for Vision-Language Pre-training. arXiv preprint arXiv:2212.09737.
  41. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15325–15336.
  42. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15671–15680.
  43. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4683–4693.
  44. PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169.
  45. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.
  46. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1307–1315.
  47. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 69–85. Springer.
  48. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14393–14402.
  49. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588.
  50. OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training. arXiv preprint arXiv:2209.05946.
  51. VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations. arXiv preprint arXiv:2207.00221.
  52. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 350–368. Springer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haozhan Shen (8 papers)
  2. Tiancheng Zhao (48 papers)
  3. Mingwei Zhu (10 papers)
  4. Jianwei Yin (71 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com