Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench

Published 3 Apr 2026 in cs.IR, cs.CL, and cs.LG | (2604.03455v1)

Abstract: Retrieval-Augmented Generation pipelines span a wide range of retrieval strategies that differ substantially in token cost and capability. Selecting the right strategy per query is a practical efficiency problem, yet no routing classifiers have been trained on RAGRouter-Bench \citep{wang2026ragrouterbench}, a recently released benchmark of $7,727$ queries spanning four knowledge domains, each annotated with one of three canonical query types: factual, reasoning, and summarization. We present the first systematic evaluation of lightweight classifier-based routing on this benchmark. Five classical classifiers are evaluated under three feature regimes, namely, TF-IDF, MiniLM sentence embeddings \citep{reimers2019sbert}, and hand-crafted structural features, yielding 15 classifier feature combinations. Our best configuration, TF-IDF with an SVM, achieves a macro-averaged F1 of $\mathbf{0.928}$ and an accuracy of $\mathbf{93.2\%}$, while simulating $\mathbf{28.1\%}$ token savings relative to always using the most expensive paradigm. Lexical TF-IDF features outperform semantic sentence embeddings by $3.1$ macro-F1 points, suggesting that surface keyword patterns are strong predictors of query-type complexity. Domain-level analysis reveals that medical queries are hardest to route and legal queries most tractable. These results establish a reproducible query-side baseline and highlight the gap that corpus-aware routing must close.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces lightweight classifier-based query routing for adaptive RAG by using query text to select the optimal retrieval paradigm.
It demonstrates that TF-IDF features with SVM outperform semantic and structural metrics, achieving 0.928 macro-F1 and 28.1% token savings.
The study underscores that combining query-side signals with corpus-side features could further optimize efficiency and answer quality in RAG systems.

Lightweight Classifier-Based Query Routing for Adaptive RAG: Baseline Evaluation on RAGRouter-Bench

Problem Motivation and Background

Retrieval-Augmented Generation (RAG) pipelines are widely adopted to ground LLM outputs in external corpora across diverse domains. However, existing systems tend to apply a uniform retrieval strategy per query, disregarding the heterogeneous cost and capability profiles of different RAG paradigms. The selection of an optimal paradigm is not merely a function of query type but reflects complex query-corpus interactions, underscoring an open efficiency problem: can lightweight query-side classifiers cost-effectively route queries to appropriate RAG strategies?

RAGRouter-Bench (Wang et al., 30 Jan 2026) provides a rigorously annotated benchmark for RAG routing, comprising 7,727 queries across four domains (Wikipedia, literature, legal, and medical). Each query is labeled as factual, reasoning, or summarization, and evaluated with five distinct RAG paradigms. Prior work in adaptive retrieval (e.g., Adaptive-RAG [jeong2024adaptiverag]) and LLM routing (e.g., RouteLLM [ong2025routellm], FrugalGPT [chen2023frugalgpt]) have demonstrated the potential of classifier-based routing, but none have established systematic query-side baseline performance on publicly labeled datasets like RAGRouter-Bench.

Methodological Approach

This study presents the first suite of lightweight classifier baselines for query routing on RAGRouter-Bench, investigating 15 combinations of classical classifiers (Logistic Regression, SVM, Random Forest, KNN, MLP) and three feature regimes: lexical (TF-IDF), semantic (MiniLM sentence embeddings), and structural (hand-crafted query metrics). All features are extracted from query text alone, requiring no retrieval or LLM computations, thus serving as strict query-side baselines.

Query type annotation is used as the routing target. A literature-motivated mapping is adopted: factual (single-hop) queries route to NaiveRAG, reasoning (multi-hop) to HybridRAG, and summarization to IterativeRAG. Cost savings are simulated using empirically established paradigm token cost ratios, benchmarking baseline savings versus always using the most expensive paradigm and versus a perfect-label reference.

Model training employs 5-fold stratified cross-validation, reporting macro-F1 and accuracy due to label imbalance. Token savings are computed post-hoc according to paradigm predictions.

Empirical Results

TF-IDF features with SVM yield the best overall performance, with macro-averaged F1 of 0.928 and 93.2% accuracy, simulating 28.1% token savings relative to universal IterativeRAG routing. Lexical features outperform semantic embeddings (MiniLM), achieving 3.1 macro-F1 points higher, and surpassing structural hand-crafted features by 14 points.

Macro-F1 and accuracy metrics are consistently highest for legal-domain queries and lowest for medical-domain queries. Legal queries exhibit formulaic lexical patterns conducive to high-precision routing; in contrast, medical queries are more challenging due to homogeneous corpus structure and less discriminatory surface vocabulary.

A joint analysis of cost and accuracy illustrates the nuanced tradeoff: the majority-class baseline collapses all queries to the cheapest paradigm, achieving maximal savings (60.0%) but minimal accuracy (macro-F1 0.231). Perfect-label routing yields 35.2% savings; high-accuracy classifiers recover 78–80% of these savings, confirming that optimizing for cost in isolation degrades answer quality.

Implications and Limitations

These findings establish reproducible baselines quantifying the routing signal available solely from query text. The strong performance of TF-IDF features highlights that surface keyword patterns are robust indicators of query-type complexity in this benchmark. However, the gap to the perfect-label reference and the domain-wise performance disparities reinforce the benchmark’s core assertion: query-side signals alone are insufficient, and optimal routing fundamentally depends on query-corpus interactions.

The study deliberately excludes corpus-side structural and semantic features, marking these baselines as a lower bound for routing performance. The type-to-paradigm mapping, while literature-backed, is a simplification; broader corpus-context aware routing policies may further optimize efficiency and quality. Evaluation is restricted to four benchmark domains; generalization across other corpora remains unexplored.

Future Directions

Advancements in adaptive RAG routing should incorporate corpus-side properties (structural metrics, dispersion, hubness, intrinsic dimension) as additional features, potentially through hybrid approaches that combine query and corpus signals. Further research may explore domain-adaptive transformers, multi-modal routing classifiers, or meta-learning strategies to dynamically calibrate retrieval paradigms. Out-of-distribution generalization and robustness to emerging domains and query classes are critical next steps.

Conclusion

This baseline study systematically evaluates lightweight query routing classifiers on RAGRouter-Bench, demonstrating strong macro-F1 and substantial simulated cost reductions using query text alone. TF-IDF features are more predictive than semantic embeddings, driven by discriminative surface vocabulary. The analysis underscores the necessity of integrating corpus-side signals for optimal paradigm selection and cost-quality tradeoff resolution. These baselines are a reference point for future work on adaptive RAG routing architectures that leverage richer contextual information for efficient, domain-sensitive retrieval augmentation.

Citation: "Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench" (2604.03455)

Markdown Report Issue