- The paper introduces RAGRoute, a novel mechanism using a lightweight neural network to dynamically select relevant data sources for efficient federated retrieval-augmented generation (RAG).
- RAGRoute significantly reduces the number of queries sent to data sources (e.g., 77.5% reduction on MMLU) and communication volume while maintaining high retrieval recall and negligible impact on end-to-end RAG accuracy.
- This work demonstrates the inefficiency of querying all data sources in federated RAG setups and underscores the importance of query-aware routing strategies for scaling RAG across distributed environments.
The paper introduces RAGRoute, a novel mechanism designed to enhance the efficiency of federated retrieval-augmented generation (RAG) search across multiple data sources. Traditional RAG workflows often rely on a single vector database, which becomes impractical when information is distributed across various repositories. RAGRoute addresses this limitation by dynamically selecting relevant data sources at query time, leveraging a lightweight neural network classifier to minimize query overhead and improve retrieval efficiency.
The authors highlight the importance of addressing the limitations of LLMs such as their tendency to hallucinate and generate inconsistent responses. RAG is presented as a method of grounding model responses in external knowledge sources by retrieving relevant documents and incorporating them into the LLM prompt. The paper argues that federated RAG search, which queries multiple data sources, offers advantages over using a single database, including eliminating the need for data migration and enabling seamless extension of existing databases.
The core problem addressed in the paper is the resource selection mechanism in federated RAG search. Indiscriminately querying all available data sources can lead to increased chances of LLM hallucination and significant overhead in terms of communication volume and computational cost. RAGRoute addresses this problem by learning a routing policy tailored to the structure of the data sources and the nature of the queries.
The design of RAGRoute involves a workflow that includes converting a user query into an embedding, forwarding it to a router to determine relevant data sources, querying the selected data sources, refining the results, retrieving associated data chunks, and feeding the query and data chunks to the LLM for response generation. The query router is implemented as a shallow neural network with fully connected layers, inspired by practices in Mixture of Experts (MoE) models and federated ensembles.
During the training phase, the RAGRoute router learns to make routing decisions by analyzing query-data source relevance. A set of training query embeddings is sent to all data sources to obtain relevant embeddings, and a binary relevance indicator is assigned to each query-data source pair. The model takes five features as input: the query embedding, the centroid of the queried data source, the distance between the query embedding and the centroid, the number of items in the queried data source, and the density of the queried data source. During the inference phase, the trained model efficiently routes incoming user queries to relevant data sources.
The authors evaluated the effectiveness and efficiency of RAGRoute using the MIRAGE and MMLU benchmarks. The MIRAGE benchmark, designed for medical question answering, includes 7663 questions and five corpora. The MMLU benchmark evaluates LLM systems across tasks ranging from elementary mathematics to legal reasoning, with experiments conducted on ten subject-specific subsets.
The paper presents several key results. First, RAGRoute consistently achieves high retrieval recall across all benchmarks. Second, the router demonstrates high classification performance in predicting corpora relevance for a given query. The accuracy ranges from 85.6% for MIRAGE (Top 32) to 90.1% for MMLU. Third, RAGRoute significantly reduces the number of queries sent to data sources compared to querying all data sources. For MMLU, the reduction is 77.5\%, decreasing the number of queries from 13890 to 3126. Fourth, RAGRoute decreases communication volume, with a 76.2% reduction for the MMLU benchmark. Finally, the routing inference time is minimal, with negligible impact on the end-to-end latency of queries.
Recall=TotalRelevantChunksRelevantChunksRetrieved
Where:
- RelevantChunksRetrieved is the number of relevant data chunks retrieved.
- TotalRelevantChunks is the total number of relevant data chunks.
The end-to-end accuracy results show that RAGRoute has a marginal impact on achieved RAG accuracy. For the MIRAGE benchmark, accuracy without RAG is 67.0%. With traditional RAG, accuracy increases to 72.2%. Using RAGRoute, accuracy is 72.24%. For the MMLU benchmark, accuracy is 43.59% using a single database and 43.29% using RAGRoute.
The paper compares RAGRoute with related work in RAG with multiple data sources and ML-assisted resource selection. It distinguishes RAGRoute from existing approaches by highlighting its lightweight neural network classifier and its focus on minimizing query overhead while maintaining high retrieval quality. Methods that focus on privacy when querying data sources such as Raffle, C-FedRag, and FRAG could potentially benefit from RAGRoute to ensure privacy-preserving federated search.
In conclusion, the authors present RAGRoute as an efficient routing mechanism for federated RAG that dynamically selects relevant data sources, reduces query overhead, and maintains high retrieval quality. The results confirm that querying all data sources is often unnecessary, underscoring the importance of query-aware retrieval strategies in RAG workflows.