TagRouter: Efficient LLM Routing

Updated 22 May 2026

TagRouter is a training-free routing framework that orchestrates multiple large language models using semantic tag representations for enhanced open-domain text generation.
It leverages lightweight tag generation, scoring, and decision modules to dynamically assign queries based on model performance and cost efficiency.
Empirical results show TagRouter achieves over 6 percentage points gain in accept rate and up to 17% cost reduction compared to baseline methods.

TagRouter is a training-free model routing framework that orchestrates multiple LLMs through learned tag representations for efficient open-domain text generation. It seeks to maximize system-level response quality and minimize cost by dynamically allocating each user query to the most appropriate LLM from a heterogeneous pool, using a lightweight, semantic tagging mechanism. TagRouter empirically outperforms 13 baselines, achieving higher accept rates and substantial cost reductions, and supports practical ensembling of diverse, potentially black-box, LLMs in a unified, scalable architecture (Chen et al., 14 Jun 2025).

1. Open-Domain Model Routing: Problem and Objectives

Consider a pool of $K$ LLMs, $\mathcal{M} = \{M_1, ..., M_K\}$ , where each $M_i$ possesses a known per-token cost $c_i$ and distinct response strengths. Given a batch of user queries $Q = \{q_1, ..., q_N\}$ , the routing objective is to construct a policy $g: Q \to \mathcal{M}$ that assigns each query $q$ to a model $M^*(q)$ , such that system-wide accuracy and resource cost are jointly optimized.

Accuracy is measured via accept rate (AR), defined by the fraction of model responses that achieve “win” or “tie” labels, determined through pairwise judgments against a high-quality reference LLM. Let $\mathrm{outcome}(M, q) \in \{\mathrm{win}, \mathrm{tie}, \mathrm{loss}\}$ , and the indicator $I_{\mathrm{win} \cup \mathrm{tie}}(M, q) = 1$ if the result is “win” or “tie.” The core metrics are formalized as:

Accept Rate (AR):

$\mathcal{M} = \{M_1, ..., M_K\}$ 0

Relative Cost:

$\mathcal{M} = \{M_1, ..., M_K\}$ 1

The target is high $\mathcal{M} = \{M_1, ..., M_K\}$ 2 with low $\mathcal{M} = \{M_1, ..., M_K\}$ 3.

2. Tag Construction and Semantic Representation

TagRouter maps each input query $\mathcal{M} = \{M_1, ..., M_K\}$ 4 to a compact set of semantic tags, $\mathcal{M} = \{M_1, ..., M_K\}$ 5, selected from a normalized vocabulary $\mathcal{M} = \{M_1, ..., M_K\}$ 6. Tag construction involves several processing stages:

Raw Tag Generation: For each $\mathcal{M} = \{M_1, ..., M_K\}$ 7, a strong LLM (ERNIE-4.0-Turbo) generates coarse-grained tags via a system instruction, accumulating $\mathcal{M} = \{M_1, ..., M_K\}$ 814,000 unique raw tags in the BCUQ corpus.
Normalization: Tags occurring fewer than five times are filtered. Aggregation removes punctuation and normalizes casing. Tag semantics are further clustered: tags are embedded using a PhraseBERT encoder $\mathcal{M} = \{M_1, ..., M_K\}$ 9; DBSCAN clusters embeddings, and nearest pairs are merged within clusters until each contains at least two exemplars (per Algorithm 1 (Chen et al., 14 Jun 2025)).
Mathematical Representation: The resulting vocabulary $M_i$ 0 (size 1,601 for BCUQ). For a query $M_i$ 1, $M_i$ 2 of size $M_i$ 3 is produced by the TagGenerator model. At routing time, tags may be realigned to nearest neighbors in $M_i$ 4 using cosine similarity:

$M_i$ 5

3. Inference Pipeline and Routing Algorithm

The TagRouter algorithm consists of three key modules:

3.1 TAGGENERATOR

Maps each query $M_i$ 6 to a tag set $M_i$ 7 using the trained tag generator model.

3.2 TAGSCORER

Associates each model $M_i$ 8 and tag $M_i$ 9 with a precomputed performance score:

$c_i$ 0

where:

$c_i$ 1: counts of “win,” “tie,” or “loss” for $c_i$ 2 on queries with tag $c_i$ 3
$c_i$ 4: weights (default: $c_i$ 5)
$c_i$ 6: emphasizes rare but reliable tags

For a query, $c_i$ 7.

3.3 TAGDECIDER

Selects the model $c_i$ 8. For two-model cost-aware routing, the differential:

$c_i$ 9

is compared to threshold $Q = \{q_1, ..., q_N\}$ 0—adjusting $Q = \{q_1, ..., q_N\}$ 1 enables continuous AR vs. cost trade-off.

Pseudocode Outline:

$q$ 4

4. Theoretical Properties and Cost-Performance Guarantees

TagRouter's empirical nature is complemented by formal characterizations:

Accept Rate at Threshold $Q = \{q_1, ..., q_N\}$ 2:

$Q = \{q_1, ..., q_N\}$ 3

AUC/PAUC: Treating the probability $Q = \{q_1, ..., q_N\}$ 4 that $Q = \{q_1, ..., q_N\}$ 5 selects an expensive LLM as $Q = \{q_1, ..., q_N\}$ 6-axis and $Q = \{q_1, ..., q_N\}$ 7 as $Q = \{q_1, ..., q_N\}$ 8-axis:

$Q = \{q_1, ..., q_N\}$ 9

Partial AUC above always-LLM baseline $g: Q \to \mathcal{M}$ 0:

$g: Q \to \mathcal{M}$ 1

Efficiency Guarantee: If $g: Q \to \mathcal{M}$ 2 is an unbiased predictor of $g: Q \to \mathcal{M}$ 3, then:

$g: Q \to \mathcal{M}$ 4

with empirically observed $g: Q \to \mathcal{M}$ 5 (17% cost reduction) and negligible $g: Q \to \mathcal{M}$ 6.

5. Empirical Evaluation and Benchmark Results

5.1 Datasets and Model Pool

Models: ERNIE-3.5 (“reference” LLM), EBspeed (cheaper), GLM4-9B, Qwen2.5-7B, EBspeedX
Benchmarks: BCUQ (95,559 real-user queries, 8 categories), Alpaca (51,014 synthetic), Dolly (14,013 crowd-sourced)

5.2 Baselines

Comparison baselines include single-model (EB3.5, EBspeed), routing-after-inference (FrugalGPT, PairRanker, Blending), routing-before-inference (RouteLLM-BERT, RouteLLM-SWR, RouteLLM-MF, RouterBench-KNN/MLP, FORC), and tag-based variants.

5.3 Metrics

Evaluated on AR (%), $g: Q \to \mathcal{M}$ 7AR relative to EB3.5, Relative Cost (EB3.5=1.0), AUC, PAUC, and GPT-Rank.

5.4 BCUQ Results

Method	AR (%)	ΔAR	Cost	AUC (%)	PAUC (%)
ERNIE-3.5	78.76	—	1.400	—	—
FrugalGPT	78.88	+0.15	1.324	70.11	0.01
PairRanker	78.76	+0.00	1.212	72.17	0.00
RouteLLM-MF	80.34	+2.01	1.197	73.94	0.12
RouterBench-KNN	80.45	+2.15	1.196	75.15	0.40
FORC	81.80	+3.86	1.182	75.73	0.76
Best Tag-based	82.02	+4.14	1.180	76.08	0.76
TagRouter	83.60	+6.15	1.164	76.10	1.46

TagRouter achieves a +6.15 percentage point gain in AR with a 17.2% cost reduction relative to the single-LLM baseline, and highest AUC and PAUC among all methods. It demonstrates superior AUC in 7 of 8 BCUQ categories and scales robustly with increased model pool size: AUC grows from 0.7610 (2 candidates) to 0.8043 (5 candidates) while holding AR effectively constant. Routing between similarly sized models yields +6 pp AR at −14% cost.

6. Practical Considerations, Scalability, and Limitations

Scalability and Evolution: TagRouter is intrinsically training-free. TAGSCORER and TAGDECIDER rely on lookup tables and threshold rules, not gradient-based optimization. When a new LLM ( $g: Q \to \mathcal{M}$ 8) is added, it suffices to annotate $g: Q \to \mathcal{M}$ 91,000 sample queries for routing score estimation; TagRouter can immediately incorporate $q$ 0 with no retraining.
Latency and Resource Requirements: TAGGENERATOR comprises a 0.5B-parameter (500MB) LLM and a 33MB embedding model. Its non-repetitive inference minimizes latency.
Cost Control: The scalar threshold $q$ 1 provides fine-grained AR vs. cost trade-off with a practical default at $q$ 2.
Comparison to Prior Methods: Training-based routers (RouteLLM-MF/BERT, RouterBench) require full retraining for candidate pool changes. Speculative/iterative methods (FrugalGPT, FORC) induce higher latency and redundant queries. TagRouter uniquely supports proprietary or black-box LLMs, is agnostic to model counts, and handles open-domain prompts.
Limitations and Prospects: Language support (TagGenerator trained on English/Chinese) is currently limited but extensible. Large-scale evaluation leverages LLM-as-judge with high agreement (Cohen’s $q$ 3 with EB4.0); further formal regret or PAC-style routing analyses remain open for future research.

7. Summary

TagRouter introduces a semantic tagging paradigm for ensemble LLM routing, enabling a dynamically extensible “super model” that adapts to the evolving LLM ecosystem. State-of-the-art accept rates (+6.15 pp) and substantial cost savings (−17.20%) are achieved without requiring per-candidate retraining, supporting deployment for cost-sensitive, open-domain text generation in practical real-world systems (Chen et al., 14 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TagRouter.

TagRouter: Efficient LLM Routing

1. Open-Domain Model Routing: Problem and Objectives

2. Tag Construction and Semantic Representation

3. Inference Pipeline and Routing Algorithm

3.1 TAGGENERATOR

3.2 TAGSCORER

3.3 TAGDECIDER

4. Theoretical Properties and Cost-Performance Guarantees

5. Empirical Evaluation and Benchmark Results

5.1 Datasets and Model Pool

5.2 Baselines

5.3 Metrics

5.4 BCUQ Results

6. Practical Considerations, Scalability, and Limitations

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TagRouter: Efficient LLM Routing

1. Open-Domain Model Routing: Problem and Objectives

2. Tag Construction and Semantic Representation

3. Inference Pipeline and Routing Algorithm

3.1 TAGGENERATOR

3.2 TAGSCORER

3.3 TAGDECIDER

4. Theoretical Properties and Cost-Performance Guarantees

5. Empirical Evaluation and Benchmark Results

5.1 Datasets and Model Pool

5.2 Baselines

5.3 Metrics

5.4 BCUQ Results

6. Practical Considerations, Scalability, and Limitations

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research