MegaRAG: Multimodal Retrieval-Augmented Generation
- MegaRAG is a multimodal Retrieval-Augmented Generation system that builds detailed Multimodal Knowledge Graphs (MMKGs) to capture textual, visual, and spatial cues.
- It integrates MMKG construction, unified multimodal retrieval, and decoupled two-stage answer generation to enable cross-modal and hierarchical reasoning over complex documents.
- Experimental evaluations demonstrate that MegaRAG outperforms conventional RAG and KG-based approaches on both global and fine-grained QA benchmarks.
MegaRAG is a multimodal Retrieval-Augmented Generation (RAG) system designed to answer questions over complex, visually rich, and long-form documents by leveraging automatically constructed Multimodal Knowledge Graphs (MMKGs). MegaRAG extends conventional RAG frameworks by incorporating visual and spatial modalities throughout the knowledge graph construction, retrieval, and answer generation stages, enabling cross-modal and hierarchical reasoning. Experimental evaluations demonstrate that MegaRAG achieves substantial improvements over previous RAG and KG-based approaches across a spectrum of global and fine-grained question answering benchmarks (Hsiao et al., 26 Nov 2025).
1. System Architecture and Pipeline
MegaRAG operates in three principal stages: MMKG construction, multimodal retrieval, and MMKG-augmented generation. The system orchestrates these phases to facilitate structured and contextually rich reasoning over documents with substantial visual content and complex structure.
MMKG Construction:
Documents are parsed into pages. For each page , the system extracts textual content (), figure images (), table images (), and a rendered layout image (). A Multimodal LLM (GPT-4o-mini) processes all modalities in parallel per page to produce initial entity nodes (; including text spans, figures, tables) and relation edges (; e.g., “illustrates,” “supports”). Page-level graphs are merged by aligning entity names and types into an initial global MMKG . A refinement step retrieves a one-hop subgraph around each page’s entities and relations; a second LLM pass incorporates missing multimodal links (e.g., connecting a chart to a supporting textual claim), resulting in a refined MMKG 0.
Indexing and Retrieval:
Entities, relations, and page images are indexed using the GME (Qwen2-VL) multimodal embedder, producing embedding vectors 1. A user query is decomposed into low-level (entity) and high-level (concept) keywords, which are embedded and aggregated. Retrieval selects the top-2 entities and relations by cosine similarity, with graph expansion to one-hop neighbors. In parallel, top-3 relevant pages are retrieved via image embedding comparison.
MMKG-Augmented Generation:
Answer generation proceeds in two distinct stages, mitigating unimodal bias. Stage one generates a visual answer 4 from retrieved pages and a KG-based answer 5 from a serialized KG subgraph. Stage two fuses 6 using a dedicated fusion prompt, producing the final answer 7.
2. Multimodal Knowledge Graph Construction
The MMKG 8 encodes information from textual, visual, and spatial cues:
- Textual Entities: Spans in 9 (e.g., definitions, concepts).
- Visual Entities: Each figure 0 and table 1 is represented as an individual node.
- Spatial Cues: The layout image 2 is used for spatial reasoning; it is not mapped directly as a node but provides context.
- Entity–Relation Extraction: For each page, initial extraction is performed by the Multimodal LLM to maximize 3. Merging aligns across pages by entity name and type, with descriptions and keywords aggregated.
- Refinement: A subgraph 4 is retrieved, and the LLM is called again to augment missing cross-modal relations, yielding refined entities 5 and relations 6. Merging all per-page outputs produces 7.
This two-pass, page-parallel construction is designed to capture both local and global multimodal relations while respecting LLM context window limitations.
3. Multimodal Retrieval Mechanisms
Retrieval in MegaRAG utilizes a unified multimodal embedding space. The system embeds both query keywords (low- and high-level) and MMKG graph components via the GME:
- Embedding Computation:
8
- Scoring:
Entities and relations are ranked by cosine similarity:
9
- Retrieval:
Top-0 entities and relations are selected, expanded to one-hop graph neighborhoods, and augmented by top-1 retrieved pages using image embeddings.
All retrieval operations are conducted in the same embedding space, enabling joint retrieval across textual and visual modalities and facilitating efficient graph-based expansion for context.
4. Answer Generation via Prompt Engineering
Answer generation leverages prompt engineering rather than architectural changes. The system serializes MMKG subgraphs as lists of "Entity–Relation–Entity" triples with brief descriptions. Pages and images form separate context blocks.
The generation protocol is as follows:
- Stage 1:
- 2
- 3
- Stage 2:
- 4
Explicit two-stage fusion supports decoupled text/visual reasoning and mitigates unimodal bias, as confirmed by ablation studies. No modifications to transformer attention or model architecture are required.
5. Experimental Evaluation
MegaRAG's efficacy is demonstrated on a comprehensive suite of global and local QA datasets, covering both textual and multimodal content:
| Dataset / Task | MegaRAG Performance | Baseline Performance |
|---|---|---|
| UltraDomain (Global QA) | Avg. win rates: Comp. 59.0%, Div. 71.4%, Emp. 74.8%, Overall 71.8% | Outperformed best baseline |
| Multimodal Global QA | DLCV, World History, Environmental Report, GenAI: Avg. win rates Comp. 83.3%, Div. 92.7%, Emp. 84.7%, Overall 89.5% | Outperformed best baseline |
| SlideVQA(2k) (Local QA) | 64.85% accuracy | LightRAG: 27.66% |
| RealMMBench (Local QA) | FinReport 39.51%, FinSlides 58.37%, TechReport 51.51%, TechSlides 60.86% | Substantially above baselines |
Evaluation on global QA uses pairwise LLM assessment (Comprehensiveness, Diversity, Empowerment, Overall); local QA measures matching accuracy judged by GPT-4.1-mini over held-out questions. MegaRAG consistently outperforms previous RAG-based baselines.
6. Component Ablations and Analysis
Ablation studies systematically disable core components:
- A1 (text-only KG): Removing figures, tables, and page images results in near-zero win rates on slide-centric benchmarks, verifying the necessity of truly multimodal construction.
- A2 (disable MMKG retrieval): Relying solely on page-level retrieval leads to complete performance collapse (MegaRAG beats A2 at ≈100% win rate), affirming the centrality of structured graph-based retrieval.
- A3 (single-pass generation): Merging text and visual reasoning into one stage reduces Diversity and Empowerment metrics by 14–25 points, demonstrating the value of decoupled two-stage answer synthesis.
These results underscore that integrated multimodal graph construction, retrieval, and decoupled generation are all necessary for strong performance on visually rich and conceptually complex document QA tasks.
7. Limitations and Future Directions
MegaRAG's scalability is constrained by reliance on two LLM passes per document page and by the computational demands of GME-based embedding and retrieval. Only a single round of MMKG refinement is performed; further iterative refinement may enhance cross-page coherence. No end-to-end fine-tuning of components is used, suggesting that joint optimization via retrieval-guided LM fine-tuning could yield additional improvements.
Predicted avenues for future research include: (i) iterative graph refinement using learned graph embeddings (e.g., TransE), (ii) mediation of retrieval signals via graph neural networks, and (iii) exploration of longer context windows or retrieval caching to accommodate larger corpora and documents.
In summary, MegaRAG demonstrates that incorporation of MMKGs, unified multimodal retrieval, and a structured, two-stage generation procedure provides state-of-the-art performance on both global and fine-grained QA over complex multimodal documents (Hsiao et al., 26 Nov 2025).