MMORE: Massive Multimodal Open RAG & Extraction

Published 15 Sep 2025 in cs.SE and cs.AI | (2509.11937v1)

Abstract: We introduce MMORE, an open-source pipeline for Massive Multimodal Open RetrievalAugmented Generation and Extraction, designed to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale. MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format to enable downstream applications for LLMs. The architecture offers modular, distributed processing, enabling scalable parallelization across CPUs and GPUs. On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs. The pipeline integrates hybrid dense-sparse retrieval and supports both interactive APIs and batch RAG endpoints. Evaluated on PubMedQA, MMORE-augmented medical LLMs improve biomedical QA accuracy with increasing retrieval depth. MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems on diverse, real-world multimodal data. The codebase is available at https://github.com/swiss-ai/mmore.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MMORE, a modular and scalable RAG pipeline that unifies extraction from over fifteen file types with robust OCR and layout analysis.
The paper demonstrates near-linear scaling and up to 155% efficiency gains through distributed processing, validating its high throughput on complex documents.
The paper shows improved downstream QA performance, with effective retrieval-augmented inference boosting biomedical task accuracy and outperforming existing pipelines.

MMORE: A Scalable Pipeline for Massive Multimodal Open RAG and Extraction

Motivation and Context

The MMORE pipeline addresses a critical bottleneck in the deployment of retrieval-augmented generation (RAG) systems: the ingestion, transformation, and retrieval of knowledge from heterogeneous, real-world document formats at scale. The proliferation of unstructured and multimodal data—spanning PDFs, spreadsheets, presentations, images, audio, and video—has outpaced the capabilities of existing document processing and RAG frameworks, which are typically limited by modality coverage, throughput, or closed-source constraints. As the supply of high-quality, human-generated text data approaches exhaustion under current LLM scaling trends, robust, format-agnostic preprocessing and retrieval workflows become essential for both model verifiability and continued performance improvements.

System Architecture

MMORE is designed as a modular, distributed pipeline that unifies extraction, transformation, embedding, and retrieval for over fifteen file types. The architecture is built for extensibility and high-throughput parallelization, leveraging Dask for distributed execution across CPUs and GPUs, and supporting deployment from single workstations to large Kubernetes clusters.

Multimodal Data Extraction

The core processing module standardizes heterogeneous content into a unified JSON-based format, the MultimodalSample, which interleaves plain text with modality placeholders and maintains a registry of extracted modalities (e.g., images, audio snippets). This design enables downstream tasks—such as multimodal pre-training or RAG—to operate on tightly linked text and non-text elements. Extraction leverages open-source tools: Surya for PDF/OCR, Whisper for audio, and standard Python libraries for office formats. The processor interface is abstracted to facilitate rapid addition of new file types via lightweight subclassing.

Distributed and Extensible Processing

MMORE natively supports both intra-node and inter-node parallelism, automatically exploiting all available hardware resources. Two processing modes are provided: a default mode prioritizing accuracy (including OCR and layout analysis), and a fast mode optimized for speed (e.g., omitting OCR). This allows users to balance fidelity and throughput according to application requirements. The pipeline is designed for maintainability and community-driven extension, with each file type handled by a modular processor.

RAG Pipeline

The RAG subsystem is composed of three decoupled components:

Post-processing: Utilizes high-throughput filtering (e.g., datatrove) and supports NER, chunking, and tagging.
Indexing and Retrieval: Implements a hybrid dense-sparse indexing strategy, storing both lexical (sparse) and semantic (dense) embeddings for each document. This duality enables both interpretable keyword search and neural similarity retrieval.
Integrated RAG Service: Exposes both interactive API and batch endpoints, with configurable model, prompt, and index parameters. The system is agnostic to downstream LLMs and can be integrated with external frameworks.

Empirical Evaluation

Processing Efficiency and Accuracy

MMORE is benchmarked against Docling, a popular open-source ingestion pipeline, on both efficiency and extraction fidelity. On synthetic document scaling experiments (up to 720 pages), MMORE demonstrates near-linear scaling in distributed mode, achieving a 3.8x speedup over single-node execution and a 45% reduction in processing time compared to Docling in default mode. The fast mode, which omits OCR, yields a 155% efficiency gain.

Extraction accuracy is evaluated on Project Gutenberg books, comparing BLEU, ROUGE-L, and CER metrics. On clean, digital PDFs, all systems perform comparably (CER < 2.5%). On scanned, image-based PDFs, MMORE achieves a CER of 2.95%, substantially outperforming Docling (CER 55.18%), indicating robust OCR and layout handling. The fast mode, as expected, fails on OCR-dependent documents.

RAG Performance

On the PubMedQA biomedical QA benchmark, MMORE's RAG pipeline is evaluated using Meditron3-8B and Meditron3-70B models. Retrieval-augmented inference consistently improves accuracy as the number of retrieved documents ( $k$ ) increases. For Meditron3-70B, accuracy rises from 80.2% (no retrieval) to 82.0% ( $k=3$ ), demonstrating effective domain-specific context injection. These results validate the pipeline's utility for high-stakes, domain-specific QA tasks.

Implementation Considerations

Resource Requirements: Distributed processing is optimized for multi-GPU and multi-node environments, but the pipeline can be configured for single-node or CPU-only execution. Batch size and resource allocation are user-configurable for hardware utilization.
Extensibility: Adding new file types requires implementing a processor subclass that outputs the standardized MultimodalSample format. The modular design supports community contributions and long-term maintainability.
Deployment: MMORE is suitable for both research and production settings, supporting deployment on local machines, on-premise clusters, or cloud-based Kubernetes environments. The open-source codebase facilitates reproducibility and integration with existing RAG and LLM frameworks.
Limitations: While MMORE demonstrates strong performance on supported modalities, further benchmarking on a broader set of real-world documents is necessary to fully validate generalization. Multilingual and privacy-preserving processing are identified as future directions.

Implications and Future Directions

MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems over diverse, real-world multimodal data. Its hybrid retrieval strategy and modular architecture position it as a practical solution for both enterprise and research applications requiring verifiable, context-grounded LLM outputs. The demonstrated improvements in both processing throughput and extraction fidelity, particularly for OCR-heavy documents, address key limitations of prior open-source pipelines.

Future work will focus on expanding modality coverage (e.g., improved audiovisual alignment), supporting multilingual retrieval, and enabling federated or privacy-sensitive deployments. As LLMs are increasingly deployed in high-stakes domains, pipelines like MMORE will be essential for ensuring the verifiability, transparency, and scalability of retrieval-augmented systems.

Conclusion

MMORE advances the state of the art in multimodal document ingestion and retrieval-augmented generation by providing a scalable, open-source pipeline with broad modality support, high throughput, and strong extraction fidelity. Its modular, distributed design enables deployment across a range of environments and use cases, supporting both interactive and batch RAG workflows. The empirical results demonstrate clear advantages over existing open-source alternatives, particularly in handling complex, OCR-dependent documents and improving downstream QA accuracy. MMORE establishes a flexible platform for future research and deployment of verifiable, multimodal LLM applications.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (10)

Collections

GitHub

GitHub - swiss-ai/mmore: Massive Multimodal Open RAG & Extraction A scalable multimodal pipeline for processing, indexing, and querying multimodal documents Ever needed to take 8000 PDFs, 2000 videos, and 500 spreadsheets and feed them to an LLM as a knowledge base? Well, MMORE is here to help you! (112 stars)

alphaXiv

MMORE: Massive Multimodal Open RAG & Extraction (11 likes, 0 questions)

MMORE: Massive Multimodal Open RAG & Extraction

Summary

MMORE: A Scalable Pipeline for Massive Multimodal Open RAG and Extraction

Motivation and Context

System Architecture

Multimodal Data Extraction

Distributed and Extensible Processing

RAG Pipeline

Empirical Evaluation

Processing Efficiency and Accuracy

RAG Performance

Implementation Considerations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

GitHub

alphaXiv