Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering (2503.18491v3)

Published 24 Mar 2025 in cs.CL

Abstract: Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-LLMs (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shuo Yang (244 papers)
  2. Siwen Luo (14 papers)
  3. Soyeon Caren Han (48 papers)
  4. Eduard Hovy (115 papers)