Knowledge-Enhanced VQA Architectures
- The paper enhances VQA by integrating external knowledge through enriched prompts, explicit knowledge graph fusion, and retrieval-augmented reasoning.
- It achieves improved accuracy by dynamically filtering and fusing multimodal cues, with gains of up to 7.6% on benchmark datasets.
- These architectures are significant for enabling commonsense reasoning and precise fact retrieval in complex visual query scenarios.
Knowledge-Enhanced Visual Question Answering (VQA) architectures are designed to augment standard vision-LLMs with external knowledge sources, facilitating answers to queries requiring information beyond what is directly depicted in an image. These architectures have evolved to encompass implicit, explicit, and retrieval-augmented knowledge integration, addressing the increasing complexity of VQA tasks spanning commonsense reasoning, encyclopedic facts, and specialized domains.
1. Knowledge-Enrichment Paradigms and Model Components
Recent advances distinguish three principal paradigms in knowledge-enhanced VQA: (1) knowledge-enriched prompting and fusion, (2) explicit knowledge graph integration, and (3) retrieval-augmented generation from multimodal knowledge bases.
Knowledge-enriched prompting methods, such as LG-VQA (Ghosal et al., 2023), supplement the (image, question) pair with automatically generated textual guides, including image captions, rationales, scene graph triplets, or object lists. These are concatenated with candidate answers to form enriched prompts, subsequently fused within a pretrained backbone such as CLIP or BLIP-2 using matching or cross-modal fusion modules. Notably, CLIP’s scoring metric is:
where -normalized image embedding, is the text embedding, and is a learnable temperature. BLIP-2 contextualizes both standard input and guided prompts via its Q-Former and FlanT5 LLM, combining feature sets through concatenation, difference, and product.
Explicit knowledge integration architectures, e.g. MAGIC-VQA (Yang et al., 24 Mar 2025), retrieve structured triplets (subject, relation, object) from knowledge graphs such as ATOMIC2020, scoring their relevance via cross-modal embedding similarity and filtering by commonsense type. Further contexualization and reasoning employ GNN modules that operate over graphs unifying modality-specific and knowledge nodes, generating GNN-augmented confidence distributions for LVLM-based answer synthesis.
Retrieval-augmented frameworks (RAG), such as EchoSight (Yan et al., 17 Jul 2024) and mKG-RAG (Yuan et al., 7 Aug 2025), perform large-scale retrieval from external knowledge bases (e.g., Wikipedia) using a visual encoder (Eva-CLIP-8B, BLIP-2) for coarse selection, followed by multimodal reranking or graph retrieval. These systems deliver downstream context—wiki sections, entity facts, graph elements—to an answer generator, typically a LLM, which synthesizes the response via next-token prediction.
2. Knowledge Acquisition, Selection, and Filtering
Effective knowledge-enhanced architectures address both the acquisition and filtering of relevant information. LG-VQA (Ghosal et al., 2023) leverages rationales, captions, and scene graphs created by vision-LLMs, while MAGIC-VQA (Yang et al., 24 Mar 2025) applies type-specific post-processing to optimize triplet contextual relevance, using dynamic ratio splits (e.g., 0.7 PE, 0.15 EC, 0.15 SI for ScienceQA) and threshold-based relevance tagging.
Multi-step, graph-based approaches such as DMMGR (Li et al., 2022) iteratively query KB memory modules, attending over key-value slots storing triplet embeddings, and dynamically recalibrating query vectors. mKG-RAG (Yuan et al., 7 Aug 2025) employs dual-stage retrieval—first identifying candidate documents via visual/textual similarity, then graph-based retrieval of relevant entities/relations from multimodal knowledge graphs constructed offline.
Redundancy and noise mitigation frameworks (KF-VQA (Liu et al., 11 Sep 2025)) introduce selective filtering post-retrieval. These systems distill image-question pairs into low-noise queries, extract concise knowledge segments via VLM-LLM collaboration, and only incorporate external knowledge based on answer confidence thresholds. This strategy is quantitatively demonstrated to outperform both prior training-based and prompt-only architectures by up to +2.8% accuracy on OK-VQA.
3. Fusion and Reasoning Mechanisms
Fusion strategies encompass both attention-based and graph-based reasoning. LG-VQA performs fusion via simple concatenation into the backbone’s text encoder (CLIP) or via two-pass feature mixing in BLIP-2. MAGIC-VQA unifies enriched prompts with GNN reasoning, aggregating explicit triplet relevance and implicit confidence via message-passing over a graph of modality and knowledge nodes, yielding structurally-aware predictions.
Structural reasoning is realized extensively in mKG-RAG (Yuan et al., 7 Aug 2025) through subgraph retrieval and traversal; retrieved multimodal KG subgraphs are expanded by l-hop neighborhood (via graph traversal algorithms), integrated as factual or textual context for RAG generators. DMMGR (Li et al., 2022) executes iterative, memory-enhanced reasoning cycles combining dynamic knowledge representations with multi-head graph attention over spatial-aware object graphs.
Fine-grained retrieval, late-interaction mechanisms (UniRVQA (Deng et al., 5 Apr 2025)), and reflective-answering allow models to calibrate answer quality and adapt to knowledge boundary. UniRVQA merges retrieval and generation within shared parameters, triggers knowledge retrieval only when internal predictions fail validation, and utilizes ColBERT-style late interaction for token-level matching.
4. Quantitative Performance and Ablations
Systematic benchmarks across A-OKVQA, ScienceQA, VQAv2, InfoSeek, E-VQA, and domain datasets (e.g., DermaGraph) confirm substantial gains for knowledge-enhanced architectures. LG-VQA (Ghosal et al., 2023) improves CLIP accuracy on A-OKVQA by +7.6% (zero-shot: 58.5% → guided: 75.98%) and BLIP-2 by +4.8%. MAGIC-VQA shows up to +9 points accuracy on ScienceQA when substituting base VLPM embeddings with GNN node embeddings.
Dynamic knowledge infusion in OFA (Jhalani et al., 14 Jun 2024) yields +4.75% EM over prior SOTA, with absolute improvements of +6.48% (KVQA) and +5.95% (CRIC-VQA) for dynamically thresholded multi-hop facts. mKG-RAG (Yuan et al., 7 Aug 2025) establishes new SOTA on E-VQA, with LoRA adapters improving accuracy to 41.4% (InfoSeek) and 38.4% (E-VQA, single-hop).
Domain-specific frameworks, e.g., Med-GRIM (Madavan et al., 20 Jul 2025), achieve closed-form accuracy up to 87.5% on VQA-RAD by leveraging prompt-engineered graph retrieval and staged agent workflows at a fraction of standard computational cost. Ablation analyses confirm that knowledge filtering, graph fusion, multi-modal integrating, and prompt engineering robustly increase accuracy while reducing hallucination and overfitting.
5. Robustness, Limitations, and Future Directions
Knowledge-enhanced VQA systems exhibit several robust traits: model-agnostic integration via guiding prompts (LG-VQA), reduction of retriever-generator overhead through unified calibration (UniRVQA), mitigation of knowledge-induced noise by redundancy filtering and selective fusion (KF-VQA), and dynamic adaptation through self-reflective mechanisms.
Limitations persist. Fine-grained entity disambiguation (UniRVQA, MAGIC-VQA, Med-GRIM) is challenging for visually similar objects; hallucinations may arise if symbolic knowledge or external KBs are noisy, incomplete, or not contextually aligned. High compute requirements for offline KG construction (mKG-RAG), external knowledge coverage gaps, and domain-specificity (e.g., scene graphs in surgical VQA (Yuan et al., 2023)) necessitate scalable engineering for broader deployment.
Emerging extensions target joint retriever–generator optimization, continuous-valued knowledge alignment, scaling multimodal KGs to billion-scale, dynamic guidance selection, advanced domain-specific knowledge injection, multi-hop navigation, and human-in-the-loop interactive adaptation (Deng et al., 24 Apr 2025). Further integration of LLMs as implicit knowledge bases and reasoning agents, careful balance of precision versus recall in context selection, and the development of unified, bias-controlled benchmarks are key research priorities.
6. Representative Benchmarks, Architectures, and Case Studies
| Model | Knowledge Source(s) | Fusion Mechanism | SOTA Dataset(s) | Performance Highlight |
|---|---|---|---|---|
| LG-VQA (Ghosal et al., 2023) | Rationales, captions, scene graphs | Prompt-guided concatenation or 2-pass fusion | A-OKVQA, ScienceQA, VSR, IconQA | +7.6% (CLIP), +4.8% (BLIP-2) |
| MAGIC-VQA (Yang et al., 24 Mar 2025) | ATOMIC2020 KG, BLIP2 Embeddings | Explicit (filter+GNN); Implicit augmentation | ScienceQA, TextVQA, MMMU | All CS injection: up to +6.3% |
| EchoSight (Yan et al., 17 Jul 2024) | Wiki images+text, Wikipedia | Visual retrieval + multimodal reranking + RAG | E-VQA, InfoSeek | 41.8% E-VQA, 31.3% InfoSeek |
| mKG-RAG (Yuan et al., 7 Aug 2025) | Multimodal knowledge graph | Dual-stage retrieval + KG subgraph RAG | E-VQA, InfoSeek | 41.4% InfoSeek (LoRA tuned) |
| KF-VQA (Liu et al., 11 Sep 2025) | Google search corpus + VLM/LLM | Redundancy filtering, selective fusion | OK-VQA, A-OKVQA | 63.2% OK-VQA (training-free) |
| Med-GRIM (Madavan et al., 20 Jul 2025) | DermaGraph, BIND Dense Encodings | Graph RAG, modular agent workflow | VQA-RAD, DermaGraph | 87.5% VQA-RAD, 83.3% DermaGraph |
| UniRVQA (Deng et al., 5 Apr 2025) | External KB, MLLM pretrained | Unified retrieval-generation-reflection | OK-VQA, InfoSeek | +4.8% OK-VQA, +7.5% InfoSeek |
| DMMGR (Li et al., 2022) | KB triplet memory | Multi-step memory-guided graph attention | KRVQR, FVQA | 31.4% KRVQR, 81.2% FVQA (SOTA) |
7. Synthesis and Prospects
Knowledge-enhanced VQA architectures represent a convergence of robust multimodal representation, dynamic and precision-tuned knowledge retrieval, selective fusion, and flexible, context-sensitive reasoning mechanisms. Empirical gains on challenging benchmarks underscore the value of both explicit graph-based and implicit prompt-based enrichment, and the trend toward larger, role-modular models decouples representation learning from knowledge integration.
Critical ongoing directions include further reducing knowledge-induced noise via adaptive dynamic fusion, extending scalable multimodal KG construction, domain-general benchmarking, and integrating human feedback mechanisms to ensure alignment and explainability in real-world deployments. The intersection of robust explicit reasoning (via KGs and graphs) with implicit LLM-driven generative modeling positions knowledge-enhanced VQA as a core frontier in multimodal AI system design.