Complexity Router: Cost-Quality Trade-Off

Updated 4 July 2026

Complexity Router is a mechanism that directs inputs to different computational paths based on estimated difficulty, uncertainty, and cost to optimize quality.
Methodologies include threshold-based binary routing, uncertainty estimators, and knapsack selectors, each balancing token usage, latency, and monetary expense.
Evaluation regimes measure performance via cost–quality curves and scenario-specific metrics, addressing robustness, safety, and practical deployment challenges.

Searching arXiv for papers on complexity-aware routing and router robustness. A complexity router is a routing or orchestration mechanism that assigns an input to different computational paths according to estimated difficulty, uncertainty, role, stage, or cost, with the goal of balancing quality against resources such as token usage, latency, privacy, or monetary cost. In LLM systems, this usually means sending simple queries to cheaper or smaller models and escalating difficult or safety-critical queries to stronger or more expensive models; in collaborative settings it can also mean selecting only the most relevant memory or context for each agent under a budget (Kassem et al., 20 Mar 2025, Shafran et al., 3 Jan 2025, Hu et al., 2024, Wu et al., 12 Feb 2026, Liu et al., 6 Aug 2025).

1. Formal problem formulations

A standard formulation treats routing as an $N$ -way decision over a model set $M = \{M_1, \dots, M_N\}$ , where the router chooses the model that optimizes a quality–cost trade-off. One paper states the routing objective as

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$

where $Q(R)$ is response quality under routing policy $R$ , $C(R)$ is cost under $R$ , and $\lambda$ controls the trade-off (Kassem et al., 20 Mar 2025). RouterBench frames the router as a function

$R_\theta(x) \mapsto LLM_i \in L,$

and uses the cost-weighted score

$\text{performance score}_{ij} = \lambda \cdot P_{ij} - \text{cost}_j,$

with $M = \{M_1, \dots, M_N\}$ 0 the predicted performance of model $M = \{M_1, \dots, M_N\}$ 1 on input $M = \{M_1, \dots, M_N\}$ 2 (Hu et al., 2024).

Binary weak/strong routing is a common special case. "Rerouting LLM Routers" models a router with a scoring function $M = \{M_1, \dots, M_N\}$ 3 and threshold $M = \{M_1, \dots, M_N\}$ 4,

$M = \{M_1, \dots, M_N\}$ 5

where $M = \{M_1, \dots, M_N\}$ 6 and $M = \{M_1, \dots, M_N\}$ 7 are weak and strong models. The same work formalizes a policy constraint on the fraction of strong-model calls,

$M = \{M_1, \dots, M_N\}$ 8

so routing becomes a control-plane problem rather than only a classifier design problem (Shafran et al., 3 Jan 2025).

RouterXBench generalizes this into a thresholded edge–cloud routing policy

$M = \{M_1, \dots, M_N\}$ 9

with realized performance

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 0

Varying $R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 1 traces a cost–performance curve $R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 2 through the large-model call rate

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 3

and the average performance

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 4

which makes scenario-specific operating regimes explicit (Wu et al., 12 Feb 2026).

Multi-agent and context-routing work casts the same idea as a constrained selection problem. RCR-Router selects a subset of memory items under a per-agent token budget $R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 5:

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 6

This is explicitly treated as a knapsack problem, with a greedy selector used for efficiency (Liu et al., 6 Aug 2025).

2. Signals, architectures, and routing mechanisms

Existing complexity routers differ primarily in the signals they use to estimate difficulty. Preference-data-based routers learn from pairwise model comparisons. "How Robust Are Router-LLMs?" studies similarity-weighted ranking with Bradley–Terry, Matrix Factorization, a fine-tuned BERT classifier, and a causal LLM classifier trained on Arena battles and synthetic judgments; the intended prediction target is a win probability such as $R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 7 (Kassem et al., 20 Mar 2025). RouterBench instead uses prompt embeddings with KNN or MLP predictors to estimate per-model success and then applies a cost-weighted decision rule (Hu et al., 2024).

Other routers avoid training or reduce dependence on raw output probabilities. CP-Router is training-free and model-agnostic: it extracts option probabilities from an LLM, builds conformal prediction sets with the nonconformity score

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 8

forms

$R^* = \arg\max_R \left( \lambda Q(R) - C(R) \right),$ 9

and routes by the set size $Q(R)$ 0, typically accepting the LLM when the set is a singleton and escalating otherwise (Su et al., 26 May 2025). RouterXBench argues that internal hidden states before answer generation better capture uncertainty than output probabilities or external embeddings, and ProbeDirichlet therefore aggregates cross-layer hidden states with learned Dirichlet weights:

$Q(R)$ 1

with $Q(R)$ 2 during training and $Q(R)$ 3 at inference (Wu et al., 12 Feb 2026).

In application-specific systems, the routing signal is often richer than generic complexity. RCR-Router scores memory items by role relevance, stage priority, and recency,

$Q(R)$ 4

then greedily selects items under a strict token budget (Liu et al., 6 Aug 2025). WebRouter compresses web-agent prompts into a masked latent representation $Q(R)$ 5 and trains with a cost-aware Variational Information Bottleneck objective that penalizes expected operational cost under the predicted routing distribution (Li et al., 13 Oct 2025). Router-Suggest uses a 768D EmbeddingGemma-300m embedding of a partially typed prefix and trains a lightweight classifier with a combined cross-entropy and expected-cost objective to choose between text-only models and VLMs (Mishra et al., 9 Jan 2026).

These designs suggest that a complexity router is less a single architecture than a family of decision layers: some are calibrated classifiers over query text, some are uncertainty estimators, and some are budgeted selectors over structured memory, tools, or experts.

3. Evaluation regimes and benchmark design

Evaluation has become a central research problem because router quality depends strongly on the operating regime. RouterBench introduces a cost–quality framework based on a dataset of over 405k inference outcomes from representative LLMs and evaluates routers in the cost–quality plane using the non-decreasing convex hull (NDCH) and AIQ, a normalized area-under-frontier metric. Its Oracle baseline routes every input to the best-performing LLM, with ties broken by the cheapest model (Hu et al., 2024).

RouterXBench argues that router evaluation should be split into three dimensions: router ability, scenario alignment, and cross-domain robustness. It uses AUROC for threshold-independent router ability, then defines deployment-sensitive metrics over the cost–performance curve $Q(R)$ 6: Low-band Performance Mean (LPM) for budget-sensitive use, High-band Call Rate (HCR) for accuracy-critical use, and Mid-band Performance Mean (MPM) for the trade-off region between them (Wu et al., 12 Feb 2026). This separates discrimination skill from threshold choice.

DSC broadens evaluation further by making the failure surface explicit. It includes coding, translation, mathematics, human instructions, general knowledge, privacy, safety, and jailbreaking, and reports category-wise strong-model routing rates

$Q(R)$ 7

and a keyword-sensitivity Flipping Rate

$Q(R)$ 8

This exposes behavior that aggregate cost–accuracy numbers can hide, especially in safety-sensitive categories (Kassem et al., 20 Mar 2025).

Collaborative systems introduce another layer of evaluation. RCR-Router supplements standard QA metrics with an Answer Quality Score (AQS), implemented by an LLM judge returning a score in $Q(R)$ 9, because token savings alone do not capture whether routed context preserves explanation quality (Liu et al., 6 Aug 2025). Router-Suggest uses Partial-F1, Trigger Rate, and Typing Effort Saved rather than generic generation metrics, because the task is inline multimodal auto-completion rather than stand-alone answering (Mishra et al., 9 Jan 2026).

Taken together, these frameworks replace single-number router evaluation with frontier analysis, scenario slices, robustness tests, and task-aligned quality measures.

4. Robustness, safety, and failure modes

A major line of work shows that many routers do not estimate complexity in a robust sense. DSC finds category-driven misrouting: the MF router sends MT-Bench Math, SimpleCode, and SimpleMath to the strong model at 100%, and the BERT router sends SimpleCode at 100%, LeetCode at 100%, and SimpleMath at 98%, even though these subsets were intentionally simple. The same study reports that adding math or coding keywords causes routing decisions to flip with an average $R$ 0, indicating extreme prompt sensitivity rather than stable complexity estimation (Kassem et al., 20 Mar 2025).

Safety failures are more severe because they couple routing to model vulnerability. On AdvBench, the MF and BERT routers route harmful prompts predominantly to the weak LLM, with $R$ 1 and $R$ 2, while Mistral-7B shows 100% ASR across plain, GCG, and PAP attacks and GPT-4o shows 0% ASR on plain text and GCG but 60% ASR on PAP (Kassem et al., 20 Mar 2025). The same paper also reports that among harmful queries routed to weak LLMs, 45 of 48 were highly similar to training samples routed to the weak LLM, which it interprets as evidence of data-induced backdoor-like behavior.

"Rerouting LLM Routers" sharpens the threat model by defining LLM control plane integrity and introducing query-independent confounder gadgets: short token sequences that, when prefixed or suffixed to any query, push the router toward the strong model. In white-box settings on Pair 1, universal gadgets achieve 100% upgrade on MT-Bench for R_SW, R_MF, and R_CLS, and 98–100% upgrade on GSM8K for R_SW, R_MF, and R_CLS, with downgrade rates approximately 0% (Shafran et al., 3 Jan 2025). The same work shows that low-perplexity gadgets remain effective, so perplexity-based filtering is not a robust defense.

Commercial systems are affected as well. In the same study, gadgets increase Unify cost from $R$ 3 to $R$ 4– $R$ 5, a $R$ 6 increase, and can reroute OpenRouter traffic so that cost rises from $R$ 7 to $R$ 8, approximately $R$ 9 (Shafran et al., 3 Jan 2025). These attacks preserve answer quality on the measured benchmarks, so they function as control-plane arbitrage or cost-inflation mechanisms rather than ordinary prompt attacks.

The literature therefore identifies several recurring failure modes: category heuristics replacing true complexity estimation, brittle global thresholds, safety-blind routing to vulnerable models, and adversarial manipulation of the control layer itself.

5. Specialized complexity routers

Specialized routers instantiate the same core principle under domain-specific constraints: structured memory for multi-agent systems, uncertainty-calibrated escalation to reasoning models, multimodal latency budgets, or web-agent operating costs.

System	Routing basis	Reported outcome
RCR-Router	Role, stage, recency, token budget	Tokens reduced by 11–47%; HotPotQA F1 82.4 at 3.77K tokens
CP-Router	Conformal set size with FBE-selected $C(R)$ 0	Reduces token usage while maintaining or improving accuracy
Router-Suggest	Cost-sensitive routing between TAC and VLMs	2.3x to 10x speedup over the best-performing VLM
WebRouter	Query-specific ca-VIB routing for web agents	87.8% cost reduction vs GPT-4o with 3.8% accuracy drop

RCR-Router replaces static or full-context routing in multi-agent LLM systems with a role- and stage-conditioned knapsack selector over a structured shared memory store. On HotPotQA, Full-Context uses 5.10K tokens with AQS 4.17 and F1 73.7, whereas RCR-Router uses 3.77K tokens with AQS 4.91 and F1 82.4; on 2WikiMultihop it uses 1.24K tokens versus 2.34K for Full-Context, with AQS 4.83 and F1 80.8 (Liu et al., 6 Aug 2025). The abstract summarizes the token reduction as “up to 30%,” while Table 2 reports reductions up to approximately 47.0% on 2WikiMultihop.

CP-Router addresses the specific problem that Large Reasoning Models often overthink simple prompts. It calibrates an LLM with conformal prediction, then routes to the LRM only when uncertainty is high. In the Llama-3.1-8B plus DeepSeek-R1-Distill-Llama-8B pairing, it reaches 60.0% accuracy on College Math with 8.8% TRR, compared to 58.8% for the LRM alone; in the Qwen-2.5-14B plus DeepSeek-R1-Distill-Qwen-14B pairing, it reaches 84.7% on High School Math with 7.3% TRR, compared to 84.3% for the LRM (Su et al., 26 May 2025).

Router-Suggest formulates multimodal auto-completion as a cost-sensitive routing problem between fast text-only models and slower VLMs. On MMDD, Router-4-L reaches PR-F1 0.240 at 0.351 s, compared with MiniCPM-V at PR-F1 0.247 and 2.080 s; Router-2-L reaches PR-F1 0.240 at 0.170 s, approximately 10x faster than MiniCPM-V (Mishra et al., 9 Jan 2026). The task-specific lesson is that complexity can depend on whether visual grounding materially changes the completion.

WebRouter treats web-agent prompts as noisy, long, query-specific objects and learns a compressed representation that is sufficient for routing but explicitly cost-sensitive. Averaged over five WebVoyager websites, BrowserUse+GPT-4o achieves 86.1% accuracy at $C(R)$ 10.18, and WebRouter achieves 82.3% at $0.12, which is the basis for the reported 87.8% cost reduction relative to GPT-4o with a 3.8% absolute accuracy drop (Li et al., 13 Oct 2025).

These systems show that complexity routing is not limited to choosing between two chat models. It also governs memory exposure, reasoning depth, modality use, and agent operating budgets.

The phrase also appears in other routing literatures, where “complexity” refers not to query difficulty but to routing-table growth, hardware overhead, or combinatorial path structure. In "Two Dimensional Router: Design and Implementation," separating TCAM and SRAM transforms the worst-case TCAM explosion

$C(R)$2

into linear TCAM usage

$C(R)$3

while keeping lookup time at one TCAM cycle plus three SRAM cycles (Yang et al., 2019). In waypoint routing, the “complexity landscape” depends sharply on whether graphs are directed or undirected, whether waypoints conserve flow, and whether the number of waypoints is fixed; exact polynomial-time algorithms coexist with strong NP-completeness results (Amiri et al., 2017).

At the microarchitectural level, "A Ring Router Microarchitecture for NoCs" replaces the conventional $C(R)$4 crossbar with a ring of exchanges, reducing latency, area, and power by 53%, 34%, and 27%, respectively, compared to the conventional design (Wu, 2020). In quantum and analog computing, router architectures shift complexity into hardware-level interaction control: the superconducting quantum router realizes one-step three-qubit GHZ preparation at 86.58(8)% fidelity and native three-qubit gates in approximately 40–50 ns (Wu et al., 16 Apr 2026), while the linear-optical quantum router uses a control photonic qubit to direct a signal photonic qubit coherently between output modes with heralded success probability

$C(R)$5

in the fixed-$C(R)$6 design (Lemr et al., 2013). Rotor-router networks, finally, present a deterministic routing model in which the odometer certificate can be verified in $C(R)$7 time, substantially faster than direct simulation (Propp, 2010).

This broader record suggests a recurring abstraction: a complexity router reorganizes a routing problem so that an otherwise explosive search, memory, or control space is replaced by a structured decision layer. In LLM systems that layer is a classifier, uncertainty estimator, or budgeted selector; in networking and hardware it is a table decomposition, topology transformation, or programmable interaction fabric.