Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering (2407.20563v1)
Abstract: Visual question answering (VQA) is the task of providing accurate answers to natural language questions based on visual input. Programmatic VQA (PVQA) models have been gaining attention recently. These use LLMs to formulate executable programs that address questions requiring complex visual reasoning. However, there are challenges in enabling LLMs to comprehend the usage of image processing modules and generate relevant code. To overcome these challenges, this paper introduces PyramidCoder, a novel prompting framework for PVQA models. PyramidCoder consists of three hierarchical levels, each serving a distinct purpose: query rephrasing, code generation, and answer aggregation. Notably, PyramidCoder utilizes a single frozen LLM and pre-defined prompts at each level, eliminating the need for additional training and ensuring flexibility across various LLM architectures. Compared to the state-of-the-art PVQA model, our approach improves accuracy by at least 0.5% on the GQA dataset, 1.4% on the VQAv2 dataset, and 2.9% on the NLVR2 dataset.
- “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6904–6913.
- “A corpus for reasoning about natural language grounded in photographs,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2019, pp. 6418–6428.
- “Grounded language-image pre-training,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- “Plug-and-play vqa: Zero-shot vqa by conjoining large pre-trained models with zero training,” in Proc. Findings of Empirical Methods in Natural Language Processing (EMNLP Findings), 2022.
- “ViperGPT: Visual inference via python execution for reasoning,” arXiv:2303.08128, 2023.
- T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without training,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- “Modular visual question answering via code generation,” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
- “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
- “Tree of Thoughts: Deliberate problem solving with large language models,” in Proc. Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
- “Ask your neurons: A neural-based approach to answering questions about images,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 1–9.
- “A diagnostic study of visual question answering with analogical reasoning,” in Proc. IEEE International Conference on Image Processing (ICIP), 2021, pp. 2463–2467.
- H. Zhang and W. Wu, “Context relation fusion model for visual question answering,” in Proc. IEEE International Conference on Image Processing (ICIP), 2022, pp. 2112–2116.
- “Stacked attention networks for image question answering,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077–6086.
- A. Sarkar and M. Rahnemoonfar, “Grad-CAM aware supervised attention for visual question answering for post-disaster damage assessment,” in Proc. IEEE International Conference on Image Processing (ICIP), 2022, pp. 3783–3787.
- “Ques-to-visual guided visual question answering,” in Proc. IEEE International Conference on Image Processing (ICIP), 2022, pp. 4193–4197.
- “Text-guided object detector for multi-modal video question answering,” in Proc. IEEE Conference on Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 1032–1042.
- “Unsupervised vision-and-language pre-training without parallel images and captions,” in Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
- “Vision and text transformer for predicting answerability on visual question answering,” in Proc. IEEE International Conference on Image Processing (ICIP), 2021, pp. 934–938.
- “Language models are few-shot learners,” in Proc. Annual Conference on Neural Information Processing Systems (NeurIPS), 2020, pp. 1877–1901.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “Evaluating large language models trained on code,” arXiv2107.03374, 2021.
- R. Li et al., “Starcoder: may the source be with you!,” arXiv:2305.06161, 2023.
- “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
- “Self-consistency improves chain of thought reasoning in language models,” in Proc. International Conference on Learning Representations (ICLR), 2023.
- “Automatic chain of thought prompting in large language models,” in Proc. International Conference on Learning Representations (ICLR), 2023.
- “React: Synergizing reasoning and acting in language models,” 2023.
- “Inferring and executing programs for visual reasoning,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
- “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
- “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. International Conference on Machine Learning (ICML), 2022, pp. 12888–12900.