Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches (2403.11317v1)

Published 17 Mar 2024 in cs.CL and cs.CV

Abstract: Two approaches have emerged to input images into LLMs. The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. VQA: Visual Question Answering. International Journal of Computer Vision, 123:4–31.
  2. Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc.
  3. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. PaLI: A Jointly-Scaled Multilingual Language-Image Model. ArXiv, abs/2209.06794.
  5. Scaling Instruction-Finetuned Language Models. ArXiv:2210.11416 [cs].
  6. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  7. MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416–2428, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. International Journal of Computer Vision, 127:398–414.
  9. From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models. ArXiv:2212.10846 [cs].
  10. PromptCap: Prompt-Guided Task-Aware Image Captioning. arXiv preprint arXiv:2211.09699.
  11. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv, abs/2302.14045.
  12. Openclip.
  13. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  14. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980.
  15. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning.
  16. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  17. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  18. Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  19. Linearly Mapping from Image to Text Space. ArXiv:2209.15162 [cs].
  20. ClipCap: CLIP Prefix for Image Captioning. ArXiv:2111.09734 [cs].
  21. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
  22. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  23. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
  24. Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 951–967, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Multimodal few-shot learning with frozen language models. Proc. Neural Information Processing Systems.
  26. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  27. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. ArXiv, abs/2108.10904.
  28. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. In AAAI.
  29. Mohit Bansal Yi-Lin Sung, Jaemin Cho. 2022. VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks. In CVPR.
  30. Yin and Yang: Balancing and Answering Binary Visual Questions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5014–5022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Igor Sterner (5 papers)
  2. Weizhe Lin (23 papers)
  3. Jinghong Chen (24 papers)
  4. Bill Byrne (57 papers)
Citations (2)