Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering (2303.01903v3)

Published 3 Mar 2023 in cs.CV, cs.CL, and cs.LG
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering

Abstract: Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful LLM as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones).

A Comprehensive Examination of Prophet: Enhancing Knowledge-Based Visual Question Answering with LLMs

The paper "Prophet: Prompting LLMs with Complementary Answer Heuristics for Knowledge-based Visual Question Answering" presents a novel framework for improving the performance of knowledge-based Visual Question Answering (VQA) by leveraging the inherent capabilities of LLMs. Using Prophet, the researchers aim to overcome the limitations of existing methods that either rely extensively on external knowledge bases (KBs) or do not fully capitalize on the reasoning power of LLMs.

Methodology

The framework consists of two primary stages: Answer Heuristics Generation and Heuristics-enhanced Prompting.

  1. Answer Heuristics Generation:
    • Prophet begins by training a baseline VQA model on a knowledge-based VQA dataset. Notably, this model does not initially incorporate external knowledge, thereby serving as a straightforward, pre-trained baseline.
    • From this trained model, Prophet extracts two types of complementary answer heuristics:
      • Answer Candidates are a list of potential answers for a given question-image pair, ranked by their associated confidence scores.
      • Answer-aware Examples are in-context examples chosen based on the similarity of their answers to the target question.
    • By iterating through various discriminative and generative VQA models, such as MCAN (discriminative) and mPLUG (generative), the framework is able to yield diverse heuristics.
  2. Heuristics-enhanced Prompting:
    • This stage involves formatting a prompt that includes the extracted heuristics, which is then fed into an LLM to infer the final answer.
    • This integration of multiple knowledge sources in a structured prompt ensures the LLM can produce more accurate predictions by effectively understanding both the context and visual content of the input image-question pair.

Results and Discussion

Prophet's performance was evaluated across several challenging datasets including OK-VQA, A-OKVQA, ScienceQA, and TextVQA, each requiring different types of external domain knowledge. The experiments indicated that Prophet consistently outperforms prior state-of-the-art models across all tasks, especially demonstrating significant improvements over approaches relying on direct multimodal pretraining or simple LLM-based methods like PICa.

A key strength of Prophet lies in its versatility and scalability. It achieves notable performance even when instantiated with different combinations of VQA models and both commercial (e.g., GPT-3) and open-source LLMs (e.g., LLaMA). Importantly, the work highlights that Prophet can adapt to various types of knowledge tasks, thus demonstrating its potential as a flexible, generalizable framework in multimodal learning.

Implications and Future Directions

Prophet underscores the critical role of question-aware information in activating the full potential of LLMs for knowledge-based tasks. By focusing on the fusion of answer heuristics and LLM reasoning, it provides new insights into how LLMs can be leveraged beyond their conventional language processing functions.

However, there remains room for further exploration. For instance, future research could delve into refining the heuristics generation process or optimizing its computational efficiency. Additionally, extending Prophet's capabilities to even larger VQA datasets or integrating it with emerging LLM architectures could yield transformative advances in AI's understanding of multimodal tasks.

Overall, Prophet is a significant contribution to the field of VQA, illustrating how strategically harnessed LLMs, when complemented with well-structured input framing, can significantly enhance AI interpretative and reasoning capabilities, especially in domains reliant on external knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020, pp. 1877–1901.
  2. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” in NeurIPS, 2022.
  3. L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
  4. W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” in NeurIPS, 2021.
  5. P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, “Fvqa: Fact-based visual question answering,” IEEE TPAMI, vol. 40, no. 10, pp. 2413–2427, 2017.
  6. P. Wang, Q. Wu, C. Shen, A. R. Dick, and A. van den Hengel, “Explicit knowledge-based reasoning for visual question answering,” in IJCAI, 2017.
  7. K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in CVPR, 2019, pp. 3195–3204.
  8. D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi, “A-okvqa: A benchmark for visual question answering using world knowledge,” in ECCV.   Springer, 2022, pp. 146–162.
  9. H. Liu and P. Singh, “Conceptnet: a practical commonsense reasoning tool-kit,” BT technology journal, vol. 22, no. 4, pp. 211–226, 2004.
  10. Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, “Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering,” in IJCAI, 2020, pp. 1097–1103.
  11. K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, “Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa,” in CVPR, 2021, pp. 14 111–14 121.
  12. J. Wu, J. Lu, A. Sabharwal, and R. Mottaghi, “Multi-modal answer validation for knowledge-based vqa,” in AAAI, 2022, pp. 2712–2721.
  13. F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Natarajan, “Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering,” in CVPR, 2022, pp. 5067–5077.
  14. Y. Ding, J. Yu, B. Liu, Y. Hu, M. Cui, and Q. Wu, “Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering,” in CVPR, 2022, pp. 5089–5098.
  15. Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in AAAI, 2022, pp. 3081–3089.
  16. L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, “Kat: A knowledge augmented transformer for vision-and-language,” NAACL, 2021.
  17. Y. Lin, Y. Xie, D. Chen, Y. Xu, C. Zhu, and L. Yuan, “REVIVE: Regional visual representation matters in knowledge-based visual question answering,” in NeurIPS, 2022.
  18. Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in CVPR, 2019, pp. 6281–6290.
  19. C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao et al., “mplug: Effective and efficient vision-language learning by cross-modal skip-connections,” in EMNLP, 2022, pp. 7241–7259.
  20. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  21. G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
  22. Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” in CVPR, 2023, pp. 14 974–14 983.
  23. P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” in NeurIPS, 2022, pp. 2507–2521.
  24. A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in CVPR, 2019, pp. 8317–8326.
  25. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018, pp. 6077–6086.
  26. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021, pp. 5579–5588.
  27. S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K.-W. Chang, Z. Yao, and K. Keutzer, “How much can clip benefit vision-and-language tasks?” ICLR, 2022.
  28. R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in ICCV, 2017, pp. 804–813.
  29. J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” NeurIPS, vol. 31, 2018.
  30. L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention network for visual question answering,” in ICCV, 2019, pp. 10 313–10 322.
  31. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” ICML, 2022.
  32. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019.
  33. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” EMNLP, 2019.
  34. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020, pp. 104–120.
  35. Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and J. Yu, “Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration,” in ACM MM, 2021, pp. 797–806.
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
  37. P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022, pp. 21 218–23 340.
  38. J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” TMLR, 2022.
  39. J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017, pp. 2901–2910.
  40. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in CVPR, 2019, pp. 6700–6709.
  41. D. Vrandečić and M. Krötzsch, “Wikidata: A free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014.
  42. M. Luo, Y. Zeng, P. Banerjee, and C. Baral, “Weakly-supervised visual-retriever-reader for knowledge-based question answering,” EMNLP, pp. 6417–6431, 2021.
  43. Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo, “Promptcap: Prompt-guided image captioning for vqa with gpt-3,” in ICCV, 2023, pp. 2963–2975.
  44. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
  45. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NeurIPS, 2014.
  46. J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020.
  47. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering,” in CVPR, 2017.
  48. Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, and J. Luo, “Tap: Text-aware pre-training for text-vqa and text-caption,” in CVPR, 2021, pp. 8751–8761.
  49. A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha, “Latr: Layout-aware transformer for scene-text vqa,” in CVPR, 2022, pp. 16 548–16 558.
  50. A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, “Scene text visual question answering,” in ICCV, 2019, pp. 4291–4301.
  51. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
  52. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, vol. 123, no. 1, pp. 32–73, 2017.
  53. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  54. M. AI, “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  55. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” in NeurIPS, 2022, pp. 27 730–27 744.
  56. F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, “Conceptbert: Concept-aware representation for visual question answering,” in EMNLP, 2020, pp. 489–498.
  57. J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” arXiv preprint arXiv:2206.08916, 2022.
  58. X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer et al., “Pali: A jointly-scaled multilingual language-image model,” in ICLR, 2023.
  59. Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. Kankanhalli, “A unified end-to-end retriever-reader framework for knowledge-based vqa,” in ACM MM, 2022, pp. 2061–2069.
  60. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  61. P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” arXiv preprint arXiv:2304.09842, 2023.
  62. W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
  63. R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint arXiv:2303.16199, 2023.
  64. Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” arXiv preprint arXiv:2302.00923, 2023.
  65. R. Hu, A. Singh, T. Darrell, and M. Rohrbach, “Iterative answer prediction with pointer-augmented multimodal transformers for textvqa,” in CVPR, 2020.
  66. J. Wang, M. Gao, Y. Hu, R. R. Selvaraju, C. Ramaiah, R. Xu, J. F. JaJa, and L. S. Davis, “Tag: Boosting text-vqa via text-aware visual question-answer generation,” 2022.
  67. J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian et al., “mplug-docowl: Modularized multimodal large language model for document understanding,” arXiv preprint arXiv:2307.02499, 2023.
  68. W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973, 2023.
  69. J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” arXiv preprint arXiv:2310.11441, 2023.
  70. J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhou Yu (206 papers)
  2. Xuecheng Ouyang (2 papers)
  3. Zhenwei Shao (3 papers)
  4. Meng Wang (1063 papers)
  5. Jun Yu (232 papers)
Citations (10)