FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation (2506.12494v2)

Published 14 Jun 2025 in cs.CL and cs.IR

Abstract: Retrieval-Augmented Generation (RAG) plays a pivotal role in modern LLM applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.

Summary

The paper introduces FlexRAG, a novel framework that improves reproducibility and sharing in retrieval-augmented generation systems.
It offers a unified configuration system, comprehensive tooling, and support for text, multimodal, and web-based retrieval.
Empirical studies show FlexRAG’s superior efficiency and resource savings, enabling seamless component-level modifications for advanced research.

FlexRAG is presented as an open-source framework designed to address persistent challenges in existing Retrieval-Augmented Generation (RAG) frameworks, including difficulties in algorithm reproduction and sharing, lack of support for new techniques, and high system overhead (2506.12494). The framework aims to facilitate research and prototyping by providing a flexible and comprehensive environment for building, evaluating, and deploying RAG systems. FlexRAG supports text-based, multimodal, and network-based RAG, offering end-to-end lifecycle support with efficient asynchronous processing and persistent caching.

The framework distinguishes itself with four key features:

Research Oriented Design: FlexRAG provides a unified configuration system and standardized evaluation processes to ensure fair comparisons and reproducibility. Integration with Hugging Face Hub allows researchers to share retrievers, fostering collaboration.
Extensive Infrastructure and Tooling: It includes comprehensive documentation, pre-built retrievers on Hugging Face Hub, and a command-line toolkit for data preprocessing, retriever construction, evaluation, and GUI development.
Comprehensive Technical Support: FlexRAG supports diverse RAG scenarios, including text, multimodal, and web-based retrieval, along with end-to-end pipeline support from document parsing to chunking.
Superior Performance: The modular design, asynchronous functions, and persistent caching enhance throughput and retrieval efficiency. Advanced indexing techniques like memory map and optimized IVFPQ parameters significantly reduce CPU and memory consumption compared to comparable frameworks, achieving up to one-tenth resource usage in large-scale retrieval tasks.

The architecture of FlexRAG is composed of twelve core modules categorized into four functional groups: models, retrievers, system development, and evaluation, along with auxiliary tools.

Models: These are fundamental components used throughout the RAG pipeline.

Encoders: Convert queries or documents into dense vectors for similarity search. FlexRAG supports text encoders (e.g., BERT (Devlin et al., 2018), Contriever (Izacard et al., 2021), DPR (Karpukhin et al., 2020), Dragon (Lin et al., 2023), E5 (Wang et al., 2022)) and multimodal encoders (e.g., CLIP (Radford et al., 2021)), and allows integration with commercial APIs or deployment frameworks like Ollama and Sentence-BERT.
Rerankers: Reorder the initial list of retrieved documents to improve relevance and filter noise before passing to the generator. Supported types include cross-encoder (e.g., BGE M3 (Chen et al., 5 Feb 2024)), late-interaction (e.g., ColBERT (Khattab et al., 2020), ColBERTv2 (Santhanam et al., 2021), Jina-ColBERT (Jha et al., 29 Aug 2024)), T5-style (e.g., RankT5 (Nogueira et al., 2020)), and GPT-style (e.g., RankLLM (Sun et al., 2023)) rerankers. API-based rerankers are also supported.
Generators: Synthesize natural language responses using retrieved documents and user queries. FlexRAG includes support for traditional LLMs (e.g., Qwen2 (Yang et al., 15 Jul 2024), Llama 3 (Grattafiori et al., 31 Jul 2024)) and Vision LLMs (VLMs) (e.g., Qwen2-VL (Wang et al., 18 Sep 2024), PaliGemma 2 (Steiner et al., 4 Dec 2024)), with options for commercial APIs or fast inference engines like vLLM and Ollama.

Retrievers: Essential for finding relevant information.

Web Retrievers: Fetch information directly from the internet using components like Web Seeker (search interface or crawler), Web Downloader, and Web Reader (extracting content from HTML). Built-in examples include SimpleWebRetriever and WikipediaRetriever. This is particularly useful for accessing up-to-date information.
FlexRetriever: An in-house retriever supporting MultiField and MultiIndex paradigms, allowing flexible index creation and hybrid retrieval strategies on local knowledge bases. It supports sparse (e.g., BM25S (Lù, 4 Jul 2024)) and dense (e.g., Faiss (Douze et al., 16 Jan 2024), ScaNN (1908.10396)) retrieval. Its efficiency is enhanced by memory mapping and optimized IVFPQ indexing based on empirical formulas (Aumüller et al., 2018), contributing to lower memory overhead. Integration with Hugging Face Hub facilitates sharing.
API-Based Retriever: Allows integrating with external retrieval systems via APIs, such as Typesense and ElasticSearch.

System Development: Modules for building the complete RAG pipeline.

Preprocessors: Handle data preparation and structuring of the knowledge base. This includes Document Parser (for various formats like PDF, DOCX, HTML), Chunker (segmenting content), and Knowledge Preprocessor (cleaning and optimizing content).
Refiners: Enhance retrieved contexts. Modules include Prompt Squeezer (optimizing prompts, e.g., using LLMLingua [2023.emnlp-main.825], LongLLMLingua [2024.acl-long.91], LLMLingua-2 [2024.findings-acl.57]), Context Repacker (reorganizing context), and Context Summarizer (condensing context, e.g., using RECOMP (Xu et al., 2023), SuRe (Kim et al., 17 Apr 2024), Compress Context (Li et al., 2023)). These address issues like context length and noise (as discussed in (Shi et al., 2023, Zhang et al., 4 May 2024)).
Assistants: Encapsulate the entire RAG pipeline, providing a standardized interface for user interaction and evaluation. FlexRAG provides built-in assistants like ModularAssistant and OnelineAssistant.

Evaluation: Tools and resources for assessing RAG system performance.

Tasks: Support evaluation on various benchmarks covering multi-turn dialogue, single-turn question answering, specialized tasks, and retrieval tasks (e.g., KILT (Petroni et al., 2020), FlashRAG (Jin et al., 22 May 2024), MTRAG (Katsis et al., 7 Jan 2025), MTEB (Muennighoff et al., 2022), RAGBenchSurvey [2024.ccfbd.102]). Pre-configured retrievers for the Wikipedia knowledge base are provided on Hugging Face Hub.
Metrics: Support retrieval metrics (e.g., Success Rate) and generation metrics (e.g., F1, Exact Match using pytrec_eval, sacreBLEU, Rouge). LLM-as-a-Judge metrics are also supported.

An empirical study demonstrates the modularity and flexibility of FlexRAG's ModularAssistant. Experiments on Natural Questions (NQ) [2019.tacl.10], TriviaQA [2017.P17-1147], and PopQA (Mallen et al., 2022) show that modifying individual components (retriever, indexer, re-ranker, generator) significantly impacts overall RAG performance, highlighting the framework's utility for component-level research and comparison.

Resource overhead analysis on the MS_MARCO Passage Retrieval task [2016.arxiv.(Bajaj et al., 2016)] comparing FlexRAG and FlashRAG (Jin et al., 22 May 2024) shows FlexRAG's superior efficiency. It exhibits significantly lower average wall-clock time, total CPU time, average memory usage, and total memory usage, often by an order of magnitude or several times, particularly benefiting from the memory-mapping mechanism and optimized dense index parameters.

In comparison with existing frameworks like LangChain, LlamaIndex, FlashRAG (Jin et al., 22 May 2024), RAGLab (Zhang et al., 21 Aug 2024), AutoRAG (Kim et al., 28 Oct 2024), AutoRAG-HP (Fu et al., 27 Jun 2024), RaLLe (Hoshi et al., 2023), LocalRQA (Yu et al., 1 Mar 2024), and EasyRAG (Feng et al., 14 Oct 2024), FlexRAG positions itself as a research-oriented, comprehensive, and efficient solution. While frameworks like LangChain and LlamaIndex are feature-rich, FlexRAG focuses on reproducibility and sharing. It offers broader technical support (including web access) and integrated preprocessing compared to some lighter frameworks, while its modularity allows for efficient customization. The performance analysis confirms its efficiency advantages.

FlexRAG is available as an open-source project on GitHub, providing a practical toolkit for researchers and developers to build, evaluate, and deploy advanced RAG systems effectively.