Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge (2401.10712v5)
Abstract: With the breakthrough of multi-modal LLMs, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal LLMs as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal LLMs to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Vqa: visual question answering. In ICCV, 2015.
- Language models are few-shot learners. In NeurIPS, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In ICCV, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- ConceptBert: Concept-aware representation for visual question answering. In EMNLP, 2020.
- Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Kat: A knowledge augmented transformer for vision-and-language. In NAACL, 2022.
- From images to textual prompts: Zero-shot vqa with frozen large language models. In CVPR, 2023.
- A unified end-to-end retriever-reader framework for knowledge-based vqa. In ACM MM, 2022.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.
- Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In CVPR, 2023.
- Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
- Perceiver io: A general architecture for structured inputs & outputs. In ICLR, 2022.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
- Webly supervised concept expansion for general purpose vision models. In ECCV, 2022.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
- Microsoft coco: common objects in context. In ECCV, 2014.
- Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In NeurIPS, 2023.
- Revive: Regional visual representation matters in knowledge-based visual question answering. In NeurIPS, 2022.
- Visual instruction tuning. In NeurIPS, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Decoupled weight decay regularization. In ICLR, 2019.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023, 2023.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In CVPR, 2021.
- Linearly mapping from image to text space. In ICLR, 2023.
- Visualcomet: Reasoning about the dynamic context of a still image. In ECCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
- Prompting large language models with answer heuristics for knowledge-based visual question answering. In CVPR, 2023.
- How much can clip benefit vision-and-language tasks? In ICLR, 2022.
- Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85, 2014.
- Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015.
- Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017.
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
- Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023c.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Weifeng Ge (29 papers)
- Haibo Wang (50 papers)