MR-MKG: Multimodal Reasoning with MMKGs

Updated 30 April 2026

MR-MKG is a framework that integrates image-text subgraphs from multimodal knowledge graphs, significantly reducing hallucinations in LLM reasoning.
The system employs modules such as visual/text encoders, RGAT, and cross-modal alignment to effectively fuse symbolic and perceptual evidence.
MR-MKG demonstrates state-of-the-art performance on tasks like ScienceQA with over 92% accuracy and notable improvements in analogical reasoning.

MR-MKG

MR-MKG, or Multimodal Reasoning with Multimodal Knowledge Graphs, denotes a methodological framework for enhancing the factuality and robustness of multimodal LLMs by providing access to both symbolic and perceptual evidence via multimodal knowledge graphs (MMKGs). In contrast to unimodal KG approaches (textual-only KGs), MR-MKG directly grounds reasoning on integrated image-text subgraphs, reducing model hallucination and supporting complex cross-modal inference tasks in domains such as science question answering and analogical reasoning (Lee et al., 2024).

1. Motivation: Multimodal Hallucination and KG Limitations

Hallucination is a persistent failure mode in multimodal LLMs, whereby models confidently generate incorrect facts when prompted with image+text queries, due largely to limitations in parametric memorization and outdated or missing internal knowledge. Text-only KG augmented approaches, while mitigating some knowledge deficits, cannot utilize visual cues integral to many reasoning problems. For example, a textual KG cannot express perceptual features such as shape-based identification or texture discernment intrinsic to multimodal tasks (e.g., inferring the state “Utah” by its triangular shape in an image) (Lee et al., 2024). MR-MKG resolves these deficiencies by retrieving and integrating subgraphs from MMKGs, wherein each node contains both textual and visual instantiations of concepts and entities.

2. MR-MKG System Architecture

MR-MKG comprises a pipeline of five major modules, flexibly composing and aligning evidence from question text, input images, and subgraphs retrieved from an MMKG:

Language Encoder: Input text (question/context/choices) is embedded using the LLM’s frozen word embedding layer, forming $H_T \in \mathbb{R}^{L \times d}$ , where $L$ is sequence length and $d$ denotes embedding dimension. All LLM parameters remain frozen.
Visual Encoder and Adapter: Images are processed via a pre-trained visual backbone (e.g., CLIP ViT-L/32) to produce $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ patch-level features. A projection $W_I$ and bias $b_I$ map these to $H_I$ , after which selective attention fuses visual to text representations via $\mathrm{Softmax}(H_T H_I^\top/\sqrt{d}) H_I$ to supply image-contextualized token features $H_I'$ .
MMKG Subgraph Retrieval: Entity mentions from the question are matched to the MMKG. The system retrieves top- $K$ relevant nodes and expands to include one-hop neighbors, yielding a subgraph of $L$ 0 triples, ensuring visual and textual context.
Relational Graph Attention Network (RGAT): The retrieved MMKG subgraph is encoded via a relational GAT, where at each layer node updates are mediated by attention-weighted incoming neighbor states and edge-type embeddings. For node $L$ 1 at layer $L$ 2:

$L$ 3

where attention $L$ 4 normalizes compatibility between source/target states and relation embedding. Multihead mechanisms allow the model to attend over diverse subgraph semantics.

Knowledge Adapter and Fusion: RGAT node embeddings are linearly projected into the LLM space. Selective attention enables either text or vision encodings to attend over KG nodes, fusing symbolic and perceptual context.
Cross-modal Alignment Module: For entities present in both text and image form, a triplet loss minimizes the (squared Euclidean) distance between matched modalities and maximizes it for mismatches, refining cross-modal entity alignment:

$L$ 5

with $L$ 6, $L$ 7 denoting aligned image/text node embeddings, and $L$ 8 a negative.

The outputs from each branch ( $L$ 9, $d$ 0, $d$ 1) are concatenated and provided as the prompt to the frozen LLM decoder. Only the adapters, RGAT, alignment head, and output softmax are trainable, amounting to ~2.25% of the total parameter budget in FLAN-T5-11B (Lee et al., 2024).

3. Training Regimes and Objectives

MR-MKG training is conducted in two principal stages:

Pretraining: The model is initialized on an MMKG-grounded visual QA corpus—approximately 18k Visual Genome-based QA samples, where for each region-based question-answer pair a local MMKG subgraph is retrieved. The primary loss is an autoregressive next-token likelihood for answer generation, augmented by the cross-modal alignment objective. Only lightweight adapters, RGAT, the alignment projection, and softmax are updated.
Task-Specific Fine-Tuning: For downstream datasets (e.g., ScienceQA, MARS), MR-MKG is fine-tuned with rationale/answer sequence generation, updating the same adapters and task heads. All backbone LLM layers remain frozen, ensuring parameter-efficient transfer. The combined loss is

$d$ 2

4. Empirical Results and Ablations

MR-MKG demonstrates state-of-the-art performance under parameter-efficient constraints:

ScienceQA: MR-MKG with FLAN-T5-11B (248M trained params) achieves $d$ 3 accuracy, surpassing previous adapter-based SOTA models (LaVIN, LLaVA; $d$ 4– $d$ 5) and matching or outpacing full fine-tuning strategies (Lee et al., 2024). Adding MMKG yields $d$ 6 accuracy over text-only KGs; cross-modal alignment and MMKG pretraining yield further incremental gains.
MARS (multimodal reasoning): On Visual-LLaMA-2 7B, MR-MKG yields Hits@1 = $d$ 7 vs $d$ 8 (MKGformer), and MRR = $d$ 9 vs $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 0 (MKGformer), indicating substantial improvement in analogical reasoning tasks.
Ablations: Ablating each component confirms additive value: transition from visual-only adapters to text-KG, then MMKG inclusion, and cross-modal alignment increases ScienceQA accuracy sequentially from $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 1 to $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 2, $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 3, and $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 4. Optimal performance is attained for KG subgraphs of $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 5– $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 6 triples; larger contexts introduce distractors.
Further Analysis: RGAT outperforms GAT and vanilla GNN variants (+0.9% ScienceQA). Text-only query retrieval is most effective on ScienceQA due to its textual focus.

MR-MKG aligns with a wider trend in expanding LLMs’ reasoning capabilities through KG and multimodal augmentation, but differs in its integration of direct image-text subgraphs:

SLIF-MR (Guo et al., 14 Jul 2025): Treats multimodal/relational graphs dynamically for recommendation, enforcing semantic consistency; however, it does not address open-ended reasoning nor multimodal analogy.
M $X_I \in \mathbb{R}^{M \times d_\text{vis}}$ 7KG-RAG (Park et al., 23 Dec 2025): Employs multi-agent construction and modality-wise retrieval from MMKGs for audio-visual RAG; introduces GRASP for redundant triplet pruning.
MMKGR (Zheng et al., 2022): Focuses on multi-hop reasoning over MKGs, employing RL to traverse multi-modal entity paths, relevant for explainable KG completion.
JMAC (Tong et al., 2022): Targets joint multilingual KG completion and alignment using relation-aware GNNs for tasks such as entity matching and triple prediction, not requiring direct visual evidence.
MKG-Rank (Li et al., 20 Mar 2025): Demonstrates the use of word-level translated medical KGs for multilingual QA, but is limited to text-only KGs.

Compared to these, MR-MKG uniquely combines MMKG retrieval, relational GAT encoding, and cross-modal contrastive learning for general multimodal LLM grounding (Lee et al., 2024).

6. Limitations and Future Directions

MR-MKG’s strengths reside in its mitigation of hallucination, parameter efficiency, modular structure, and improved cross-modal grounding. Key limitations include:

Retrieval Sensitivity: Inadequate or imprecise MMKG retrieval limits performance. End-to-end learning of retrievers or incorporation of dense passage retrieval techniques is a critical open direction.
Coverage and Domain Transfer: Public MMKGs remain incomplete or biased. Onboarding new domains requires scalable methods for constructing or aligning MMKGs.
Computational Overhead: RGAT inference on large subgraphs introduces compute costs; optimizing for latency and efficiency remains important.

Potential avenues include integrating differentiable retrieval, dynamic MMKG construction during inference, scaling to higher-parameter LLMs, expanding modality support (audio, video), and embedding regularization via TransE-style losses during pretraining (Lee et al., 2024).

7. Applications and Impact

MR-MKG establishes a technical foundation for robust multimodal LLM reasoning across diverse domains that require both symbolic and perceptual knowledge. Its efficacy is demonstrated in science QA, analogical reasoning, and scenarios where grounding in updated, multimodal external knowledge is essential. The modular and parameter-efficient architecture supports deployment in resource-constrained or continually evolving environments, and the pattern of integrating MMKG evidence is becoming standard for hallucination reduction and reliable cross-modal inference in next-generation LLMs (Lee et al., 2024).