Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models (2407.15346v1)

Published 22 Jul 2024 in cs.CV, cs.CL, and cs.MM

Abstract: Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with LLMs. However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{https://github.com/Lackel/DKA}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  3. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  6. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  8. Zero-shot visual question answering with language model feedback. arXiv preprint arXiv:2305.17006.
  9. Generate then select: Open-ended visual question answering guided by world knowledge. arXiv preprint arXiv:2305.18842.
  10. Transform-retrieve-generate: Natural language-centric outside-knowledge visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5067–5077.
  11. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 489–498.
  12. Kat: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 956–968.
  13. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877.
  14. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
  15. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956.
  16. Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision, pages 662–681. Springer.
  17. Exploring question decomposition for zero-shot vqa. arXiv preprint arXiv:2310.17050.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  20. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. arXiv preprint arXiv:2210.03809.
  21. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571.
  22. Visual instruction tuning. Advances in neural information processing systems, 36.
  23. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  24. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  25. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
  26. Weakly-supervised visual-retriever-reader for knowledge-based question answering. arXiv preprint arXiv:2109.04014.
  27. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14111–14121.
  28. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  30. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer.
  31. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626.
  32. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14974–14983.
  33. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  34. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 951–967.
  35. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
  36. Filling the image information gap for vqa: Prompting large language models to proactively ask questions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2874–2890.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  38. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2712–2721.
  39. Jialin Wu and Raymond J Mooney. 2022. Entity-focused dense passage retrieval for outside-knowledge visual question answering. arXiv preprint arXiv:2210.10176.
  40. A simple baseline for knowledge-based visual question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14871–14877.
  41. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089.
  42. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290.
  43. Visual cropping improves zero-shot question answering of multimodal large language models. arXiv preprint arXiv:2310.16033.
  44. Moqagpt: Zero-shot multi-modal open-domain question answering with large language model. arXiv preprint arXiv:2310.13265.
  45. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Wenbin An (14 papers)
  2. Feng Tian (122 papers)
  3. Jiahao Nie (17 papers)
  4. Wenkai Shi (9 papers)
  5. Haonan Lin (16 papers)
  6. Yan Chen (272 papers)
  7. QianYing Wang (27 papers)
  8. Yaqiang Wu (12 papers)
  9. Guang Dai (38 papers)
  10. Ping Chen (123 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.