- The paper introduces MMORE, a modular and scalable RAG pipeline that unifies extraction from over fifteen file types with robust OCR and layout analysis.
- The paper demonstrates near-linear scaling and up to 155% efficiency gains through distributed processing, validating its high throughput on complex documents.
- The paper shows improved downstream QA performance, with effective retrieval-augmented inference boosting biomedical task accuracy and outperforming existing pipelines.
MMORE: A Scalable Pipeline for Massive Multimodal Open RAG and Extraction
Motivation and Context
The MMORE pipeline addresses a critical bottleneck in the deployment of retrieval-augmented generation (RAG) systems: the ingestion, transformation, and retrieval of knowledge from heterogeneous, real-world document formats at scale. The proliferation of unstructured and multimodal data—spanning PDFs, spreadsheets, presentations, images, audio, and video—has outpaced the capabilities of existing document processing and RAG frameworks, which are typically limited by modality coverage, throughput, or closed-source constraints. As the supply of high-quality, human-generated text data approaches exhaustion under current LLM scaling trends, robust, format-agnostic preprocessing and retrieval workflows become essential for both model verifiability and continued performance improvements.
System Architecture
MMORE is designed as a modular, distributed pipeline that unifies extraction, transformation, embedding, and retrieval for over fifteen file types. The architecture is built for extensibility and high-throughput parallelization, leveraging Dask for distributed execution across CPUs and GPUs, and supporting deployment from single workstations to large Kubernetes clusters.
The core processing module standardizes heterogeneous content into a unified JSON-based format, the MultimodalSample, which interleaves plain text with modality placeholders and maintains a registry of extracted modalities (e.g., images, audio snippets). This design enables downstream tasks—such as multimodal pre-training or RAG—to operate on tightly linked text and non-text elements. Extraction leverages open-source tools: Surya for PDF/OCR, Whisper for audio, and standard Python libraries for office formats. The processor interface is abstracted to facilitate rapid addition of new file types via lightweight subclassing.
Distributed and Extensible Processing
MMORE natively supports both intra-node and inter-node parallelism, automatically exploiting all available hardware resources. Two processing modes are provided: a default mode prioritizing accuracy (including OCR and layout analysis), and a fast mode optimized for speed (e.g., omitting OCR). This allows users to balance fidelity and throughput according to application requirements. The pipeline is designed for maintainability and community-driven extension, with each file type handled by a modular processor.
RAG Pipeline
The RAG subsystem is composed of three decoupled components:
- Post-processing: Utilizes high-throughput filtering (e.g., datatrove) and supports NER, chunking, and tagging.
- Indexing and Retrieval: Implements a hybrid dense-sparse indexing strategy, storing both lexical (sparse) and semantic (dense) embeddings for each document. This duality enables both interpretable keyword search and neural similarity retrieval.
- Integrated RAG Service: Exposes both interactive API and batch endpoints, with configurable model, prompt, and index parameters. The system is agnostic to downstream LLMs and can be integrated with external frameworks.
Empirical Evaluation
Processing Efficiency and Accuracy
MMORE is benchmarked against Docling, a popular open-source ingestion pipeline, on both efficiency and extraction fidelity. On synthetic document scaling experiments (up to 720 pages), MMORE demonstrates near-linear scaling in distributed mode, achieving a 3.8x speedup over single-node execution and a 45% reduction in processing time compared to Docling in default mode. The fast mode, which omits OCR, yields a 155% efficiency gain.
Extraction accuracy is evaluated on Project Gutenberg books, comparing BLEU, ROUGE-L, and CER metrics. On clean, digital PDFs, all systems perform comparably (CER < 2.5%). On scanned, image-based PDFs, MMORE achieves a CER of 2.95%, substantially outperforming Docling (CER 55.18%), indicating robust OCR and layout handling. The fast mode, as expected, fails on OCR-dependent documents.
On the PubMedQA biomedical QA benchmark, MMORE's RAG pipeline is evaluated using Meditron3-8B and Meditron3-70B models. Retrieval-augmented inference consistently improves accuracy as the number of retrieved documents (k) increases. For Meditron3-70B, accuracy rises from 80.2% (no retrieval) to 82.0% (k=3), demonstrating effective domain-specific context injection. These results validate the pipeline's utility for high-stakes, domain-specific QA tasks.
Implementation Considerations
- Resource Requirements: Distributed processing is optimized for multi-GPU and multi-node environments, but the pipeline can be configured for single-node or CPU-only execution. Batch size and resource allocation are user-configurable for hardware utilization.
- Extensibility: Adding new file types requires implementing a processor subclass that outputs the standardized MultimodalSample format. The modular design supports community contributions and long-term maintainability.
- Deployment: MMORE is suitable for both research and production settings, supporting deployment on local machines, on-premise clusters, or cloud-based Kubernetes environments. The open-source codebase facilitates reproducibility and integration with existing RAG and LLM frameworks.
- Limitations: While MMORE demonstrates strong performance on supported modalities, further benchmarking on a broader set of real-world documents is necessary to fully validate generalization. Multilingual and privacy-preserving processing are identified as future directions.
Implications and Future Directions
MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems over diverse, real-world multimodal data. Its hybrid retrieval strategy and modular architecture position it as a practical solution for both enterprise and research applications requiring verifiable, context-grounded LLM outputs. The demonstrated improvements in both processing throughput and extraction fidelity, particularly for OCR-heavy documents, address key limitations of prior open-source pipelines.
Future work will focus on expanding modality coverage (e.g., improved audiovisual alignment), supporting multilingual retrieval, and enabling federated or privacy-sensitive deployments. As LLMs are increasingly deployed in high-stakes domains, pipelines like MMORE will be essential for ensuring the verifiability, transparency, and scalability of retrieval-augmented systems.
Conclusion
MMORE advances the state of the art in multimodal document ingestion and retrieval-augmented generation by providing a scalable, open-source pipeline with broad modality support, high throughput, and strong extraction fidelity. Its modular, distributed design enables deployment across a range of environments and use cases, supporting both interactive and batch RAG workflows. The empirical results demonstrate clear advantages over existing open-source alternatives, particularly in handling complex, OCR-dependent documents and improving downstream QA accuracy. MMORE establishes a flexible platform for future research and deployment of verifiable, multimodal LLM applications.