TagRouter: Efficient LLM Routing
- TagRouter is a training-free routing framework that orchestrates multiple large language models using semantic tag representations for enhanced open-domain text generation.
- It leverages lightweight tag generation, scoring, and decision modules to dynamically assign queries based on model performance and cost efficiency.
- Empirical results show TagRouter achieves over 6 percentage points gain in accept rate and up to 17% cost reduction compared to baseline methods.
TagRouter is a training-free model routing framework that orchestrates multiple LLMs through learned tag representations for efficient open-domain text generation. It seeks to maximize system-level response quality and minimize cost by dynamically allocating each user query to the most appropriate LLM from a heterogeneous pool, using a lightweight, semantic tagging mechanism. TagRouter empirically outperforms 13 baselines, achieving higher accept rates and substantial cost reductions, and supports practical ensembling of diverse, potentially black-box, LLMs in a unified, scalable architecture (Chen et al., 14 Jun 2025).
1. Open-Domain Model Routing: Problem and Objectives
Consider a pool of LLMs, , where each possesses a known per-token cost and distinct response strengths. Given a batch of user queries , the routing objective is to construct a policy that assigns each query to a model , such that system-wide accuracy and resource cost are jointly optimized.
Accuracy is measured via accept rate (AR), defined by the fraction of model responses that achieve “win” or “tie” labels, determined through pairwise judgments against a high-quality reference LLM. Let , and the indicator if the result is “win” or “tie.” The core metrics are formalized as:
- Accept Rate (AR):
0
- Relative Cost:
1
The target is high 2 with low 3.
2. Tag Construction and Semantic Representation
TagRouter maps each input query 4 to a compact set of semantic tags, 5, selected from a normalized vocabulary 6. Tag construction involves several processing stages:
- Raw Tag Generation: For each 7, a strong LLM (ERNIE-4.0-Turbo) generates coarse-grained tags via a system instruction, accumulating 814,000 unique raw tags in the BCUQ corpus.
- Normalization: Tags occurring fewer than five times are filtered. Aggregation removes punctuation and normalizes casing. Tag semantics are further clustered: tags are embedded using a PhraseBERT encoder 9; DBSCAN clusters embeddings, and nearest pairs are merged within clusters until each contains at least two exemplars (per Algorithm 1 (Chen et al., 14 Jun 2025)).
- Mathematical Representation: The resulting vocabulary 0 (size 1,601 for BCUQ). For a query 1, 2 of size 3 is produced by the TagGenerator model. At routing time, tags may be realigned to nearest neighbors in 4 using cosine similarity:
5
3. Inference Pipeline and Routing Algorithm
The TagRouter algorithm consists of three key modules:
3.1 TAGGENERATOR
Maps each query 6 to a tag set 7 using the trained tag generator model.
3.2 TAGSCORER
Associates each model 8 and tag 9 with a precomputed performance score:
0
where:
- 1: counts of “win,” “tie,” or “loss” for 2 on queries with tag 3
- 4: weights (default: 5)
- 6: emphasizes rare but reliable tags
For a query, 7.
3.3 TAGDECIDER
Selects the model 8. For two-model cost-aware routing, the differential:
9
is compared to threshold 0—adjusting 1 enables continuous AR vs. cost trade-off.
Pseudocode Outline:
4
4. Theoretical Properties and Cost-Performance Guarantees
TagRouter's empirical nature is complemented by formal characterizations:
- Accept Rate at Threshold 2:
3
- AUC/PAUC: Treating the probability 4 that 5 selects an expensive LLM as 6-axis and 7 as 8-axis:
9
Partial AUC above always-LLM baseline 0:
1
- Efficiency Guarantee: If 2 is an unbiased predictor of 3, then:
4
with empirically observed 5 (17% cost reduction) and negligible 6.
5. Empirical Evaluation and Benchmark Results
5.1 Datasets and Model Pool
- Models: ERNIE-3.5 (“reference” LLM), EBspeed (cheaper), GLM4-9B, Qwen2.5-7B, EBspeedX
- Benchmarks: BCUQ (95,559 real-user queries, 8 categories), Alpaca (51,014 synthetic), Dolly (14,013 crowd-sourced)
5.2 Baselines
Comparison baselines include single-model (EB3.5, EBspeed), routing-after-inference (FrugalGPT, PairRanker, Blending), routing-before-inference (RouteLLM-BERT, RouteLLM-SWR, RouteLLM-MF, RouterBench-KNN/MLP, FORC), and tag-based variants.
5.3 Metrics
Evaluated on AR (%), 7AR relative to EB3.5, Relative Cost (EB3.5=1.0), AUC, PAUC, and GPT-Rank.
5.4 BCUQ Results
| Method | AR (%) | ΔAR | Cost | AUC (%) | PAUC (%) |
|---|---|---|---|---|---|
| ERNIE-3.5 | 78.76 | — | 1.400 | — | — |
| FrugalGPT | 78.88 | +0.15 | 1.324 | 70.11 | 0.01 |
| PairRanker | 78.76 | +0.00 | 1.212 | 72.17 | 0.00 |
| RouteLLM-MF | 80.34 | +2.01 | 1.197 | 73.94 | 0.12 |
| RouterBench-KNN | 80.45 | +2.15 | 1.196 | 75.15 | 0.40 |
| FORC | 81.80 | +3.86 | 1.182 | 75.73 | 0.76 |
| Best Tag-based | 82.02 | +4.14 | 1.180 | 76.08 | 0.76 |
| TagRouter | 83.60 | +6.15 | 1.164 | 76.10 | 1.46 |
TagRouter achieves a +6.15 percentage point gain in AR with a 17.2% cost reduction relative to the single-LLM baseline, and highest AUC and PAUC among all methods. It demonstrates superior AUC in 7 of 8 BCUQ categories and scales robustly with increased model pool size: AUC grows from 0.7610 (2 candidates) to 0.8043 (5 candidates) while holding AR effectively constant. Routing between similarly sized models yields +6 pp AR at −14% cost.
6. Practical Considerations, Scalability, and Limitations
- Scalability and Evolution: TagRouter is intrinsically training-free. TAGSCORER and TAGDECIDER rely on lookup tables and threshold rules, not gradient-based optimization. When a new LLM (8) is added, it suffices to annotate 91,000 sample queries for routing score estimation; TagRouter can immediately incorporate 0 with no retraining.
- Latency and Resource Requirements: TAGGENERATOR comprises a 0.5B-parameter (500MB) LLM and a 33MB embedding model. Its non-repetitive inference minimizes latency.
- Cost Control: The scalar threshold 1 provides fine-grained AR vs. cost trade-off with a practical default at 2.
- Comparison to Prior Methods: Training-based routers (RouteLLM-MF/BERT, RouterBench) require full retraining for candidate pool changes. Speculative/iterative methods (FrugalGPT, FORC) induce higher latency and redundant queries. TagRouter uniquely supports proprietary or black-box LLMs, is agnostic to model counts, and handles open-domain prompts.
- Limitations and Prospects: Language support (TagGenerator trained on English/Chinese) is currently limited but extensible. Large-scale evaluation leverages LLM-as-judge with high agreement (Cohen’s 3 with EB4.0); further formal regret or PAC-style routing analyses remain open for future research.
7. Summary
TagRouter introduces a semantic tagging paradigm for ensemble LLM routing, enabling a dynamically extensible “super model” that adapts to the evolving LLM ecosystem. State-of-the-art accept rates (+6.15 pp) and substantial cost savings (−17.20%) are achieved without requiring per-candidate retraining, supporting deployment for cost-sensitive, open-domain text generation in practical real-world systems (Chen et al., 14 Jun 2025).