MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering (2503.18491v3)

Published 24 Mar 2025 in cs.CL

Abstract: Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-LLMs (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Shuo Yang (244 papers)
Siwen Luo (14 papers)
Soyeon Caren Han (48 papers)
Eduard Hovy (115 papers)

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering (2503.18491v3)

Related Papers