MegaRAG: Multimodal Retrieval-Augmented Generation

Updated 8 April 2026

MegaRAG is a multimodal Retrieval-Augmented Generation system that builds detailed Multimodal Knowledge Graphs (MMKGs) to capture textual, visual, and spatial cues.
It integrates MMKG construction, unified multimodal retrieval, and decoupled two-stage answer generation to enable cross-modal and hierarchical reasoning over complex documents.
Experimental evaluations demonstrate that MegaRAG outperforms conventional RAG and KG-based approaches on both global and fine-grained QA benchmarks.

MegaRAG is a multimodal Retrieval-Augmented Generation (RAG) system designed to answer questions over complex, visually rich, and long-form documents by leveraging automatically constructed Multimodal Knowledge Graphs (MMKGs). MegaRAG extends conventional RAG frameworks by incorporating visual and spatial modalities throughout the knowledge graph construction, retrieval, and answer generation stages, enabling cross-modal and hierarchical reasoning. Experimental evaluations demonstrate that MegaRAG achieves substantial improvements over previous RAG and KG-based approaches across a spectrum of global and fine-grained question answering benchmarks (Hsiao et al., 26 Nov 2025).

1. System Architecture and Pipeline

MegaRAG operates in three principal stages: MMKG construction, multimodal retrieval, and MMKG-augmented generation. The system orchestrates these phases to facilitate structured and contextually rich reasoning over documents with substantial visual content and complex structure.

MMKG Construction:

Documents are parsed into $N$ pages. For each page $i$ , the system extracts textual content ( $T_i$ ), figure images ( $F_i$ ), table images ( $B_i$ ), and a rendered layout image ( $I_i$ ). A Multimodal LLM (GPT-4o-mini) processes all modalities in parallel per page to produce initial entity nodes ( $E_i^0$ ; including text spans, figures, tables) and relation edges ( $R_i^0$ ; e.g., “illustrates,” “supports”). Page-level graphs are merged by aligning entity names and types into an initial global MMKG $\mathcal{G}^0$ . A refinement step retrieves a one-hop subgraph $\mathcal{G}^0_i$ around each page’s entities and relations; a second LLM pass incorporates missing multimodal links (e.g., connecting a chart to a supporting textual claim), resulting in a refined MMKG $i$ 0.

Indexing and Retrieval:

Entities, relations, and page images are indexed using the GME (Qwen2-VL) multimodal embedder, producing embedding vectors $i$ 1. A user query is decomposed into low-level (entity) and high-level (concept) keywords, which are embedded and aggregated. Retrieval selects the top- $i$ 2 entities and relations by cosine similarity, with graph expansion to one-hop neighbors. In parallel, top- $i$ 3 relevant pages are retrieved via image embedding comparison.

MMKG-Augmented Generation:

Answer generation proceeds in two distinct stages, mitigating unimodal bias. Stage one generates a visual answer $i$ 4 from retrieved pages and a KG-based answer $i$ 5 from a serialized KG subgraph. Stage two fuses $i$ 6 using a dedicated fusion prompt, producing the final answer $i$ 7.

2. Multimodal Knowledge Graph Construction

The MMKG $i$ 8 encodes information from textual, visual, and spatial cues:

Textual Entities: Spans in $i$ 9 (e.g., definitions, concepts).
Visual Entities: Each figure $T_i$ 0 and table $T_i$ 1 is represented as an individual node.
Spatial Cues: The layout image $T_i$ 2 is used for spatial reasoning; it is not mapped directly as a node but provides context.
Entity–Relation Extraction: For each page, initial extraction is performed by the Multimodal LLM to maximize $T_i$ 3. Merging aligns across pages by entity name and type, with descriptions and keywords aggregated.
Refinement: A subgraph $T_i$ 4 is retrieved, and the LLM is called again to augment missing cross-modal relations, yielding refined entities $T_i$ 5 and relations $T_i$ 6. Merging all per-page outputs produces $T_i$ 7.

This two-pass, page-parallel construction is designed to capture both local and global multimodal relations while respecting LLM context window limitations.

3. Multimodal Retrieval Mechanisms

Retrieval in MegaRAG utilizes a unified multimodal embedding space. The system embeds both query keywords (low- and high-level) and MMKG graph components via the GME:

Embedding Computation:

$T_i$ 8

Scoring:

Entities and relations are ranked by cosine similarity:

$T_i$ 9

Retrieval:

Top- $F_i$ 0 entities and relations are selected, expanded to one-hop graph neighborhoods, and augmented by top- $F_i$ 1 retrieved pages using image embeddings.

All retrieval operations are conducted in the same embedding space, enabling joint retrieval across textual and visual modalities and facilitating efficient graph-based expansion for context.

4. Answer Generation via Prompt Engineering

Answer generation leverages prompt engineering rather than architectural changes. The system serializes MMKG subgraphs as lists of "Entity–Relation–Entity" triples with brief descriptions. Pages and images form separate context blocks.

The generation protocol is as follows:

Stage 1:
- $F_i$ 2
- $F_i$ 3
Stage 2:
- $F_i$ 4

Explicit two-stage fusion supports decoupled text/visual reasoning and mitigates unimodal bias, as confirmed by ablation studies. No modifications to transformer attention or model architecture are required.

5. Experimental Evaluation

MegaRAG's efficacy is demonstrated on a comprehensive suite of global and local QA datasets, covering both textual and multimodal content:

Dataset / Task	MegaRAG Performance	Baseline Performance
UltraDomain (Global QA)	Avg. win rates: Comp. 59.0%, Div. 71.4%, Emp. 74.8%, Overall 71.8%	Outperformed best baseline
Multimodal Global QA	DLCV, World History, Environmental Report, GenAI: Avg. win rates Comp. 83.3%, Div. 92.7%, Emp. 84.7%, Overall 89.5%	Outperformed best baseline
SlideVQA(2k) (Local QA)	64.85% accuracy	LightRAG: 27.66%
RealMMBench (Local QA)	FinReport 39.51%, FinSlides 58.37%, TechReport 51.51%, TechSlides 60.86%	Substantially above baselines

Evaluation on global QA uses pairwise LLM assessment (Comprehensiveness, Diversity, Empowerment, Overall); local QA measures matching accuracy judged by GPT-4.1-mini over held-out questions. MegaRAG consistently outperforms previous RAG-based baselines.

6. Component Ablations and Analysis

Ablation studies systematically disable core components:

A1 (text-only KG): Removing figures, tables, and page images results in near-zero win rates on slide-centric benchmarks, verifying the necessity of truly multimodal construction.
A2 (disable MMKG retrieval): Relying solely on page-level retrieval leads to complete performance collapse (MegaRAG beats A2 at ≈100% win rate), affirming the centrality of structured graph-based retrieval.
A3 (single-pass generation): Merging text and visual reasoning into one stage reduces Diversity and Empowerment metrics by 14–25 points, demonstrating the value of decoupled two-stage answer synthesis.

These results underscore that integrated multimodal graph construction, retrieval, and decoupled generation are all necessary for strong performance on visually rich and conceptually complex document QA tasks.

7. Limitations and Future Directions

MegaRAG's scalability is constrained by reliance on two LLM passes per document page and by the computational demands of GME-based embedding and retrieval. Only a single round of MMKG refinement is performed; further iterative refinement may enhance cross-page coherence. No end-to-end fine-tuning of components is used, suggesting that joint optimization via retrieval-guided LM fine-tuning could yield additional improvements.

Predicted avenues for future research include: (i) iterative graph refinement using learned graph embeddings (e.g., TransE), (ii) mediation of retrieval signals via graph neural networks, and (iii) exploration of longer context windows or retrieval caching to accommodate larger corpora and documents.

In summary, MegaRAG demonstrates that incorporation of MMKGs, unified multimodal retrieval, and a structured, two-stage generation procedure provides state-of-the-art performance on both global and fine-grained QA over complex multimodal documents (Hsiao et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaRAG.