Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering (2402.12728v2)

Published 20 Feb 2024 in cs.CV, cs.AI, cs.CL, cs.IR, and cs.LG

Abstract: Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage LLMs as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Ai unreliable answers: A case study on chatgpt. In ICHCI, pages 23–40. Springer.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  3. Mutan: Multimodal tucker fusion for visual question answering. In ICCV, pages 2612–2620.
  4. Language models are few-shot learners. NeurIPS, 33:1877–1901.
  5. Neighbor enhanced graph convolutional networks for node classification and recommendation. KBS, 246:108594.
  6. Hierarchy-aware multi-hop question answering over knowledge graphs. In WWW, pages 2519–2527.
  7. Conceptbert: Concept-aware representation for visual question answering. In EMNLP, pages 489–498.
  8. Learning to fake it: limited responses and fabricated references provided by chatgpt for medical questions. Mayo Clinic Proceedings: Digital Health, 1(3):226–234.
  9. Kat: A knowledge augmented transformer for vision-and-language. In NAACL, pages 956–968.
  10. A unified end-to-end retriever-reader framework for knowledge-based vqa. In ICMM, pages 2061–2069.
  11. Siamese masked autoencoders. CVPR.
  12. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617.
  13. G-mixup: Graph data augmentation for graph classification. In ICML, pages 8230–8248. PMLR.
  14. Knowledge-to-sql: Enhancing sql generation with data expert llm.
  15. Promptcap: Prompt-guided task-aware image captioning. ICCV.
  16. Bilinear attention networks. NeurIPS, 31.
  17. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer.
  18. Revive: Regional visual representation matters in knowledge-based visual question answering. NeuIPS, 35:10560–10571.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In CVPR, pages 14111–14121.
  21. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204.
  22. Vlc-bert: visual question answering with contextualized commonsense knowledge. In WACV, pages 1155–1165.
  23. In chatgpt we trust? measuring and characterizing the reliability of chatgpt. arXiv preprint arXiv:2304.08979.
  24. Differentiable neuro-symbolic reasoning on large-scale knowledge graphs. NeurIPS, 36.
  25. Explainable and explicit visual reasoning over scene graphs. In CVPR, pages 8376–8384.
  26. Conceptnet 5.5: An open multilingual graph of general knowledge.
  27. Ingo Steinwart and Clint Scovel. 2012. Mercer’s theorem on general domains: On the interaction between measures, kernels, and rkhss. Constructive Approximation, 35:363–417.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  30. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  31. Fvqa: Fact-based visual question answering. TPAMI, 40(10):2413–2427.
  32. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  33. Multi-modal answer validation for knowledge-based vqa. In AAAI, volume 36, pages 2712–2721.
  34. Pseudo siamese network for few-shot intent generation. In SIGIR, pages 2005–2009.
  35. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, volume 36, pages 3081–3089.
  36. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition, 108:107563.
  37. Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290.
  38. Toward multi-granularity decision-making: Explicit visual reasoning with hierarchical knowledge. In ICCV, pages 2573–2583.
  39. Explicit knowledge incorporation for visual reasoning. In ICCV, pages 1356–1365.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junnan Dong (14 papers)
  2. Qinggang Zhang (19 papers)
  3. Huachi Zhou (5 papers)
  4. Daochen Zha (56 papers)
  5. Pai Zheng (7 papers)
  6. Xiao Huang (112 papers)
Citations (2)