Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SADL: An Effective In-Context Learning Method for Compositional Visual QA (2407.01983v1)

Published 2 Jul 2024 in cs.CV

Abstract: Large vision-LLMs (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstrations of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual-linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling, Deliberation, and Pseudo-Labeling of image-question pairs. Given an image-question query, we sample image-question pairs from the training data that are in semantic proximity to the query. To address the compositional nature of questions, the deliberation step decomposes complex questions into a sequence of subquestions. Finally, the sequence is progressively annotated one subquestion at a time to generate a sequence of pseudo-labels. We investigate the behaviors of SADL under OpenFlamingo on large-scale Visual QA datasets, namely GQA, GQA-OOD, CLEVR, and CRIC. The evaluation demonstrates the critical roles of sampling in the neighborhood of the image, the decomposition of complex questions, and the accurate pairing of the subquestions and labels. These findings do not always align with those found in language-only ICL, suggesting fresh insights in vision-language settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Openflamingo, Mar. 2023.
  3. Fair attention network for robust visual question answering. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2024.
  4. See and learn more: Dense caption-aware representation for visual question answering. IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1135–1146, 2024.
  5. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Sinc: Self-supervised in-context learning for vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15430–15442, October 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  9. Successive prompting for decomposing complex questions. arXiv preprint arXiv:2212.04092, 2022.
  10. Cric: A vqa dataset for compositional reasoning on vision and commonsense. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5561–5578, 2023.
  11. Demystifying prompts in language models via perplexity estimation, 2022.
  12. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
  13. Learning by abstraction: The neural state machine. In NeurIPS, pages 5901–5914, 2019.
  14. Compositional attention networks for machine reasoning. ICLR, 2018.
  15. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  16. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017.
  17. Roses are red, violets are blue… but should vqa expect them to? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2776–2785, 2021.
  18. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023.
  19. Multilingual constituency parsing with self-attention and pre-training. In Annual Meeting of the Association for Computational Linguistics, 2018.
  20. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32 – 73, 2016.
  21. Dynamic language binding in relational visual reasoning. arXiv preprint arXiv:2004.14603, 2020.
  22. Diverse demonstrations improve in-context compositional generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1401–1422, 2023.
  23. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  25. Visual instruction tuning. In NeurIPS, 2023.
  26. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics.
  27. Question type-aware debiasing for test-time visual question answering model adaptation. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2024.
  28. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  29. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022.
  30. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
  31. Coarse-to-fine reasoning for visual question answering. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4557–4565, 2021.
  32. Pro-tuning: Unified prompt tuning for vision tasks. IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4653–4667, 2024.
  33. OpenAI. Gpt-4 technical report, 2023.
  34. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
  35. Scene graph refinement network for visual question answering. IEEE Transactions on Multimedia, 25:3950–3961, 2023.
  36. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
  37. An information-theoretic approach to prompt engineering without ground truth labels. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 819–862, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  38. Multilingual LLMs are better cross-lingual in-context learners with alignment. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6292–6307, Toronto, Canada, July 2023. Association for Computational Linguistics.
  39. Multimodal few-shot learning with frozen language models. In Neural Information Processing Systems, 2021.
  40. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730, 2022.
  41. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  42. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  43. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering, 2023.
  44. Knowledge-based visual question generation. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7547–7558, 2022.
  45. An Explanation of In-context Learning as Implicit Bayesian Inference. In International Conference on Learning Representations, 2022.
  46. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
  47. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2022.
  48. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Long Hoang Dang (6 papers)
  2. Thao Minh Le (16 papers)
  3. Vuong Le (22 papers)
  4. Tu Minh Phuong (8 papers)
  5. Truyen Tran (112 papers)