Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modularized Zero-shot VQA with Pre-trained Models (2305.17369v2)

Published 27 May 2023 in cs.CV and cs.MM

Abstract: Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. VQA: visual question answering - www.visualqa.org. Int. J. Comput. Vis., 123(1):4–31.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
  3. Neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 39–48.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systemsm, NeurIPS.
  5. Cross-dataset adaptation for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 5716–5725.
  6. Meta module network for compositional visual reasoning. In IEEE Winter Conference on Applications of Computer Vision, WACV, pages 655–664. IEEE.
  7. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL, pages 2383–2395.
  8. Transforming question answering datasets into natural language inference datasets. CoRR, abs/1809.02922.
  9. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, pages 731–742.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR.
  11. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 457–468.
  12. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 6325–6334.
  13. Neural module networks for reasoning over text. In 8th International Conference on Learning Representations, ICLR.
  14. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 770–778.
  15. Learning to reason: End-to-end module networks for visual question answering. In IEEE International Conference on Computer Vision, ICCV, pages 804–813.
  16. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 6700–6709.
  17. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ACL, pages 2763–2775.
  18. MDETR - modulated detection for end-to-end multi-modal understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV, pages 1760–1770.
  19. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73.
  20. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML, volume 162, pages 12888–12900. PMLR.
  21. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34, NeurIPS, pages 9694–9705.
  22. Visualbert: A simple and performant baseline for vision and language. CoRR, abs/1908.03557.
  23. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, NeurIPS, pages 13–23.
  24. Simple open-vocabulary object detection with vision transformers. CoRR, abs/2205.06230.
  25. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, ICML, volume 139, pages 8748–8763.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  27. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 91–99.
  28. How much can CLIP benefit vision-and-language tasks? In The Tenth International Conference on Learning Representations, ICLR.
  29. CLIP models are few-shot learners: Empirical studies on VQA and visual entailment. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pages 6088–6100.
  30. Reclip: A strong zero-shot baseline for referring expression comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, pages 5198–5215.
  31. Winoground: Probing vision and language models for visio-linguistic compositionality. CoRR, abs/2204.03162.
  32. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training. CoRR, abs/2210.08773.
  33. Multimodal few-shot learning with frozen language models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 200–212.
  34. Open-ended visual question answering by multi-modal domain adaptation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP, volume EMNLP 2020 of Findings of ACL, pages 367–376.
  35. Zero-shot video question answering via frozen bidirectional language models. CoRR, abs/2206.08155.
  36. An empirical study of GPT-3 for few-shot knowledge-based VQA. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI, pages 3081–3089.
  37. Domain-robust VQA with diverse datasets and methods but no target labels. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 7046–7056.
  38. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. CoRR, abs/2207.00221.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Rui Cao (65 papers)
  2. Jing Jiang (192 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.