FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation (2506.12494v2)

Published 14 Jun 2025 in cs.CL and cs.IR

Abstract: Retrieval-Augmented Generation (RAG) plays a pivotal role in modern LLM applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.

Summary

The paper introduces FlexRAG, a novel framework that improves reproducibility and sharing in retrieval-augmented generation systems.
It offers a unified configuration system, comprehensive tooling, and support for text, multimodal, and web-based retrieval.
Empirical studies show FlexRAG’s superior efficiency and resource savings, enabling seamless component-level modifications for advanced research.

FlexRAG is presented as an open-source framework designed to address persistent challenges in existing Retrieval-Augmented Generation (RAG) frameworks, including difficulties in algorithm reproduction and sharing, lack of support for new techniques, and high system overhead (2506.12494). The framework aims to facilitate research and prototyping by providing a flexible and comprehensive environment for building, evaluating, and deploying RAG systems. FlexRAG supports text-based, multimodal, and network-based RAG, offering end-to-end lifecycle support with efficient asynchronous processing and persistent caching.

The framework distinguishes itself with four key features:

Research Oriented Design: FlexRAG provides a unified configuration system and standardized evaluation processes to ensure fair comparisons and reproducibility. Integration with Hugging Face Hub allows researchers to share retrievers, fostering collaboration.
Extensive Infrastructure and Tooling: It includes comprehensive documentation, pre-built retrievers on Hugging Face Hub, and a command-line toolkit for data preprocessing, retriever construction, evaluation, and GUI development.
Comprehensive Technical Support: FlexRAG supports diverse RAG scenarios, including text, multimodal, and web-based retrieval, along with end-to-end pipeline support from document parsing to chunking.
Superior Performance: The modular design, asynchronous functions, and persistent caching enhance throughput and retrieval efficiency. Advanced indexing techniques like memory map and optimized IVFPQ parameters significantly reduce CPU and memory consumption compared to comparable frameworks, achieving up to one-tenth resource usage in large-scale retrieval tasks.

The architecture of FlexRAG is composed of twelve core modules categorized into four functional groups: models, retrievers, system development, and evaluation, along with auxiliary tools.

Models: These are fundamental components used throughout the RAG pipeline.

Encoders: Convert queries or documents into dense vectors for similarity search. FlexRAG supports text encoders (e.g., BERT (1810.04805), Contriever (2112.09118), DPR (2004.04906), Dragon (2302.07452), E5 (2212.03533)) and multimodal encoders (e.g., CLIP (2103.00020)), and allows integration with commercial APIs or deployment frameworks like Ollama and Sentence-BERT.
Rerankers: Reorder the initial list of retrieved documents to improve relevance and filter noise before passing to the generator. Supported types include cross-encoder (e.g., BGE M3 (2402.03216)), late-interaction (e.g., ColBERT (2004.12832), ColBERTv2 (2112.01488), Jina-ColBERT (2408.16672)), T5-style (e.g., RankT5 (2003.06713)), and GPT-style (e.g., RankLLM (2304.09542)) rerankers. API-based rerankers are also supported.
Generators: Synthesize natural language responses using retrieved documents and user queries. FlexRAG includes support for traditional LLMs (e.g., Qwen2 (2407.10671), Llama 3 (2407.21783)) and Vision LLMs (VLMs) (e.g., Qwen2-VL (2409.12191), PaliGemma 2 (2412.03555)), with options for commercial APIs or fast inference engines like vLLM and Ollama.

Retrievers: Essential for finding relevant information.

Web Retrievers: Fetch information directly from the internet using components like Web Seeker (search interface or crawler), Web Downloader, and Web Reader (extracting content from HTML). Built-in examples include SimpleWebRetriever and WikipediaRetriever. This is particularly useful for accessing up-to-date information.
FlexRetriever: An in-house retriever supporting MultiField and MultiIndex paradigms, allowing flexible index creation and hybrid retrieval strategies on local knowledge bases. It supports sparse (e.g., BM25S (2407.03618)) and dense (e.g., Faiss (2401.08281), ScaNN (1908.10396)) retrieval. Its efficiency is enhanced by memory mapping and optimized IVFPQ indexing based on empirical formulas (1807.05614), contributing to lower memory overhead. Integration with Hugging Face Hub facilitates sharing.
API-Based Retriever: Allows integrating with external retrieval systems via APIs, such as Typesense and ElasticSearch.

System Development: Modules for building the complete RAG pipeline.

Preprocessors: Handle data preparation and structuring of the knowledge base. This includes Document Parser (for various formats like PDF, DOCX, HTML), Chunker (segmenting content), and Knowledge Preprocessor (cleaning and optimizing content).
Refiners: Enhance retrieved contexts. Modules include Prompt Squeezer (optimizing prompts, e.g., using LLMLingua [2023.emnlp-main.825], LongLLMLingua [2024.acl-long.91], LLMLingua-2 [2024.findings-acl.57]), Context Repacker (reorganizing context), and Context Summarizer (condensing context, e.g., using RECOMP (2310.04408), SuRe (2404.13081), Compress Context (2310.06201)). These address issues like context length and noise (as discussed in (2302.00093, 2405.02659)).
Assistants: Encapsulate the entire RAG pipeline, providing a standardized interface for user interaction and evaluation. FlexRAG provides built-in assistants like ModularAssistant and OnelineAssistant.

Evaluation: Tools and resources for assessing RAG system performance.

Tasks: Support evaluation on various benchmarks covering multi-turn dialogue, single-turn question answering, specialized tasks, and retrieval tasks (e.g., KILT (2009.02252), FlashRAG (2405.13576), MTRAG (2501.03468), MTEB (2210.07316), RAGBenchSurvey [2024.ccfbd.102]). Pre-configured retrievers for the Wikipedia knowledge base are provided on Hugging Face Hub.
Metrics: Support retrieval metrics (e.g., Success Rate) and generation metrics (e.g., F1, Exact Match using pytrec_eval, sacreBLEU, Rouge). LLM-as-a-Judge metrics are also supported.

An empirical paper demonstrates the modularity and flexibility of FlexRAG's ModularAssistant. Experiments on Natural Questions (NQ) [2019.tacl.10], TriviaQA [2017.P17-1147], and PopQA (2212.10511) show that modifying individual components (retriever, indexer, re-ranker, generator) significantly impacts overall RAG performance, highlighting the framework's utility for component-level research and comparison.

Resource overhead analysis on the MS_MARCO Passage Retrieval task [2016.arxiv.(1611.09268)] comparing FlexRAG and FlashRAG (2405.13576) shows FlexRAG's superior efficiency. It exhibits significantly lower average wall-clock time, total CPU time, average memory usage, and total memory usage, often by an order of magnitude or several times, particularly benefiting from the memory-mapping mechanism and optimized dense index parameters.

In comparison with existing frameworks like LangChain, LlamaIndex, FlashRAG (2405.13576), RAGLab (2408.11381), AutoRAG (2410.20878), AutoRAG-HP (2406.19251), RaLLe (2308.10633), LocalRQA (2403.00982), and EasyRAG (2410.10315), FlexRAG positions itself as a research-oriented, comprehensive, and efficient solution. While frameworks like LangChain and LlamaIndex are feature-rich, FlexRAG focuses on reproducibility and sharing. It offers broader technical support (including web access) and integrated preprocessing compared to some lighter frameworks, while its modularity allows for efficient customization. The performance analysis confirms its efficiency advantages.

FlexRAG is available as an open-source project on GitHub, providing a practical toolkit for researchers and developers to build, evaluate, and deploy advanced RAG systems effectively.

PDF Markdown

Related Papers

GitHub

GitHub - ictnlp/FlexRAG: FlexRAG: A RAG Framework for Information Retrieval and Generation. (172 stars)

Tweets

https://twitter.com/_reachsumit/status/1934858107106316316