Papers
Topics
Authors
Recent
Search
2000 character limit reached

MegaRAG: Multimodal Retrieval-Augmented Generation

Updated 8 April 2026
  • MegaRAG is a multimodal Retrieval-Augmented Generation system that builds detailed Multimodal Knowledge Graphs (MMKGs) to capture textual, visual, and spatial cues.
  • It integrates MMKG construction, unified multimodal retrieval, and decoupled two-stage answer generation to enable cross-modal and hierarchical reasoning over complex documents.
  • Experimental evaluations demonstrate that MegaRAG outperforms conventional RAG and KG-based approaches on both global and fine-grained QA benchmarks.

MegaRAG is a multimodal Retrieval-Augmented Generation (RAG) system designed to answer questions over complex, visually rich, and long-form documents by leveraging automatically constructed Multimodal Knowledge Graphs (MMKGs). MegaRAG extends conventional RAG frameworks by incorporating visual and spatial modalities throughout the knowledge graph construction, retrieval, and answer generation stages, enabling cross-modal and hierarchical reasoning. Experimental evaluations demonstrate that MegaRAG achieves substantial improvements over previous RAG and KG-based approaches across a spectrum of global and fine-grained question answering benchmarks (Hsiao et al., 26 Nov 2025).

1. System Architecture and Pipeline

MegaRAG operates in three principal stages: MMKG construction, multimodal retrieval, and MMKG-augmented generation. The system orchestrates these phases to facilitate structured and contextually rich reasoning over documents with substantial visual content and complex structure.

MMKG Construction:

Documents are parsed into NN pages. For each page ii, the system extracts textual content (TiT_i), figure images (FiF_i), table images (BiB_i), and a rendered layout image (IiI_i). A Multimodal LLM (GPT-4o-mini) processes all modalities in parallel per page to produce initial entity nodes (Ei0E_i^0; including text spans, figures, tables) and relation edges (Ri0R_i^0; e.g., “illustrates,” “supports”). Page-level graphs are merged by aligning entity names and types into an initial global MMKG G0\mathcal{G}^0. A refinement step retrieves a one-hop subgraph Gi0\mathcal{G}^0_i around each page’s entities and relations; a second LLM pass incorporates missing multimodal links (e.g., connecting a chart to a supporting textual claim), resulting in a refined MMKG ii0.

Indexing and Retrieval:

Entities, relations, and page images are indexed using the GME (Qwen2-VL) multimodal embedder, producing embedding vectors ii1. A user query is decomposed into low-level (entity) and high-level (concept) keywords, which are embedded and aggregated. Retrieval selects the top-ii2 entities and relations by cosine similarity, with graph expansion to one-hop neighbors. In parallel, top-ii3 relevant pages are retrieved via image embedding comparison.

MMKG-Augmented Generation:

Answer generation proceeds in two distinct stages, mitigating unimodal bias. Stage one generates a visual answer ii4 from retrieved pages and a KG-based answer ii5 from a serialized KG subgraph. Stage two fuses ii6 using a dedicated fusion prompt, producing the final answer ii7.

2. Multimodal Knowledge Graph Construction

The MMKG ii8 encodes information from textual, visual, and spatial cues:

  • Textual Entities: Spans in ii9 (e.g., definitions, concepts).
  • Visual Entities: Each figure TiT_i0 and table TiT_i1 is represented as an individual node.
  • Spatial Cues: The layout image TiT_i2 is used for spatial reasoning; it is not mapped directly as a node but provides context.
  • Entity–Relation Extraction: For each page, initial extraction is performed by the Multimodal LLM to maximize TiT_i3. Merging aligns across pages by entity name and type, with descriptions and keywords aggregated.
  • Refinement: A subgraph TiT_i4 is retrieved, and the LLM is called again to augment missing cross-modal relations, yielding refined entities TiT_i5 and relations TiT_i6. Merging all per-page outputs produces TiT_i7.

This two-pass, page-parallel construction is designed to capture both local and global multimodal relations while respecting LLM context window limitations.

3. Multimodal Retrieval Mechanisms

Retrieval in MegaRAG utilizes a unified multimodal embedding space. The system embeds both query keywords (low- and high-level) and MMKG graph components via the GME:

  • Embedding Computation:

TiT_i8

  • Scoring:

Entities and relations are ranked by cosine similarity:

TiT_i9

  • Retrieval:

Top-FiF_i0 entities and relations are selected, expanded to one-hop graph neighborhoods, and augmented by top-FiF_i1 retrieved pages using image embeddings.

All retrieval operations are conducted in the same embedding space, enabling joint retrieval across textual and visual modalities and facilitating efficient graph-based expansion for context.

4. Answer Generation via Prompt Engineering

Answer generation leverages prompt engineering rather than architectural changes. The system serializes MMKG subgraphs as lists of "Entity–Relation–Entity" triples with brief descriptions. Pages and images form separate context blocks.

The generation protocol is as follows:

  1. Stage 1:
    • FiF_i2
    • FiF_i3
  2. Stage 2:
    • FiF_i4

Explicit two-stage fusion supports decoupled text/visual reasoning and mitigates unimodal bias, as confirmed by ablation studies. No modifications to transformer attention or model architecture are required.

5. Experimental Evaluation

MegaRAG's efficacy is demonstrated on a comprehensive suite of global and local QA datasets, covering both textual and multimodal content:

Dataset / Task MegaRAG Performance Baseline Performance
UltraDomain (Global QA) Avg. win rates: Comp. 59.0%, Div. 71.4%, Emp. 74.8%, Overall 71.8% Outperformed best baseline
Multimodal Global QA DLCV, World History, Environmental Report, GenAI: Avg. win rates Comp. 83.3%, Div. 92.7%, Emp. 84.7%, Overall 89.5% Outperformed best baseline
SlideVQA(2k) (Local QA) 64.85% accuracy LightRAG: 27.66%
RealMMBench (Local QA) FinReport 39.51%, FinSlides 58.37%, TechReport 51.51%, TechSlides 60.86% Substantially above baselines

Evaluation on global QA uses pairwise LLM assessment (Comprehensiveness, Diversity, Empowerment, Overall); local QA measures matching accuracy judged by GPT-4.1-mini over held-out questions. MegaRAG consistently outperforms previous RAG-based baselines.

6. Component Ablations and Analysis

Ablation studies systematically disable core components:

  • A1 (text-only KG): Removing figures, tables, and page images results in near-zero win rates on slide-centric benchmarks, verifying the necessity of truly multimodal construction.
  • A2 (disable MMKG retrieval): Relying solely on page-level retrieval leads to complete performance collapse (MegaRAG beats A2 at ≈100% win rate), affirming the centrality of structured graph-based retrieval.
  • A3 (single-pass generation): Merging text and visual reasoning into one stage reduces Diversity and Empowerment metrics by 14–25 points, demonstrating the value of decoupled two-stage answer synthesis.

These results underscore that integrated multimodal graph construction, retrieval, and decoupled generation are all necessary for strong performance on visually rich and conceptually complex document QA tasks.

7. Limitations and Future Directions

MegaRAG's scalability is constrained by reliance on two LLM passes per document page and by the computational demands of GME-based embedding and retrieval. Only a single round of MMKG refinement is performed; further iterative refinement may enhance cross-page coherence. No end-to-end fine-tuning of components is used, suggesting that joint optimization via retrieval-guided LM fine-tuning could yield additional improvements.

Predicted avenues for future research include: (i) iterative graph refinement using learned graph embeddings (e.g., TransE), (ii) mediation of retrieval signals via graph neural networks, and (iii) exploration of longer context windows or retrieval caching to accommodate larger corpora and documents.

In summary, MegaRAG demonstrates that incorporation of MMKGs, unified multimodal retrieval, and a structured, two-stage generation procedure provides state-of-the-art performance on both global and fine-grained QA over complex multimodal documents (Hsiao et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaRAG.