Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Published 8 May 2026 in cs.LG, cs.AI, and cs.CL | (2605.07395v1)

Abstract: Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper identifies that evaluation artifacts, not intrinsic model limitations, largely inflate unsolvability ceilings in multi-LLM routing.
It employs a dual-judge framework across over 206,000 query-model pairs from six benchmarks to distinguish genuine unsolvability from protocol-induced errors.
Artifacts such as truncation and output format mismatches misguide router training, causing majority-class collapse and loss of 13–17 pp routing gains.

Unsolvability Ceiling in Multi-LLM Routing: Deconstructing Evaluation Artifacts

Overview

The paper conducts an extensive empirical analysis of multi-LLM routing for cost-effective deployment, focusing on the so-called "unsolvability ceiling"—the proportion of queries unsolvable by any model in a pool. Evaluating over 206,000 query-model pairs across six benchmarks and leveraging the Gemma~4 and Llama~3.1 families, the authors demonstrate that a substantial part of reported unsolvability is not due to fundamental model limits but to artifacts induced by the evaluation protocols. The work presents a decomposition framework that distinguishes genuine unsolvability from three dominant artifact types: evaluation misalignment, truncation, and output format mismatch. They further show how these artifacts misguide router training, resulting in high opportunity costs and unreliable headroom estimates, with strong implications for multi-LLM infrastructure in both research and deployment contexts.

Empirical Framework and Experimental Design

The study systematically evaluates four tiers of Gemma~4 (2B dense to 31B dense, including a 26B-A4B MoE with equal active compute as the 4B dense) on dedicated H100s, and cross-validates findings using Llama~3.1-8B/70B under strictly greedy, non-thinking-mode decoding. Six diverse benchmarks are analyzed: MMLU, MedQA, Alpaca, ShareGPT, HumanEval, and MBPP, yielding a total of 206,756 query-model pairs.

Evaluation relies on both LLM-as-a-judge (Gemma~4 26B-A4B, 0/1/2 rubric) and exact-match metrics (for MMLU and MedQA). This dual-judge methodology is essential, as it enables attribution of observed failure rates to specific artifact classes, rather than to fundamental modeling deficits.

Artifact Decomposition and Unsolvability Inflation

Artifact 1: Evaluation Misalignment

Systematic divergence between LLM-judge scoring and exact-match criteria is observed, with directionality dependent on task. On MMLU, the judge consistently underrates model performance—up to 23.8 percentage points (pp) at 31B; for MedQA, large models are overrated by the judge relative to exact-match by 5–6 pp. The judge rewards fluency, reasoning presentation, and structure, occasionally granting partial or full credit to factually incorrect but well-argued outputs, while penalizing terse or nonstandard valid responses. This creates an artificial gap between measured and true unsolvability on both ends.

Artifact 2: Truncation

High truncation rates under fixed generation budgets (65% for MMLU, 57% for MedQA) cause substantial output incompleteness, especially for knowledge-heavy tasks or smaller models that require more tokens to respond. Truncated responses not only score artificially low with the judge but also fail exact-match extraction, thus compounding unsolvability inflation and suppressing upper-bound estimates for routing gain.

Artifact 3: Output Format Mismatch

Automatic extraction of answers (particularly answer letters in multiple-choice formats) fails for 5.4–12.4% of MMLU responses and 0.6–5.2% of MedQA, with higher rates for smaller models. This disproportionately inflates apparent unsolvability and compresses performance gradients between tiers.

Dual-Judge Validation and Corrected Ceiling Estimates

By grounding LLM-judge metrics with exact-match, the authors reveal strong bidirectional distortions. For MMLU, the judge underestimates correct answers by up to 23.8 pp at 31B, obscuring an actual routing gain of +28.9 pp (exact-match) compared to +15.9 pp (judge-derived). For MedQA, judge-overestimation leads to a 10.3 pp inflation of routing gain. The aggregate opportunity cost attributable to these artifacts is measured at 13–17 pp for knowledge-intensive tasks.

Cross-family analysis with Llama-3.1 and DeepSeek-Chat demonstrates that the knowledge task ceiling is model-training-specific, while for code generation and conversational tasks (e.g., ShareGPT), unsolvability is nearly universal and family-agnostic.

Router Training and the Collapse Phenomenon

Standard routers trained on artifact-inflated labels collapse to the majority class due to severe label imbalance (e.g., E2B optimal for 79.3% of queries). All three evaluated router architectures—logistic regression, MLP, DistilBERT—have near-zero recall on minority classes and fail to make effective tier-based routing decisions. Random-feature and shuffled-label controls confirm this is a byproduct of the oracle label marginal, not model expressivity or training pathologies.

A balanced-classifier (cost-sensitive objective) partially mitigates the collapse, confirming that the routing challenge is not fundamental but protocol-induced. The dominant routing signal in non-cost-sensitive routers degenerates into length-based triage, lacking true difficulty discrimination. This is empirically linked to a missed routing gain of 13–17 pp across knowledge-intensive settings, attributable to misaligned oracle labels.

Latency, Cost, and Route Selection

Latency analysis reveals that the 26B-A4B MoE tier is a Pareto-optimal routing target within the Gemma~4 family: it offers a significant quality gain over its dense compute equivalents with near-identical inference cost and stricter tail latency, justifying its use except for the hardest queries that uniquely require the 31B dense model (only 5.5% of queries).

Recommendations and Implications

Evaluation Protocols

Dual-judge validation using both LLM-as-a-judge and task-specific exact-match is essential for reliable unsolvability and routing gain estimation.
Truncation-aware budgets should keep response truncation below 5%, with rates explicitly reported.
Oracle label construction for router training must align with deployment objectives; if downstream tasks require terminal answer correctness, exact-match should be prioritized.

Router Architectures

Cost-sensitive objectives (e.g., class weighting, minority oversampling) should be standard to avoid majority-class collapse.
Domain-detection as a pre-filter improves routing on knowledge-intensive tasks by circumventing label skew.
Oracle label auditing for evaluation misalignment can reduce the opportunity cost associated with artifact-induced headroom inflation.

Theoretical and Practical Implications

The work exposes fundamental methodological flaws in widely used multi-tier LLM routing evaluations. Assumptions regarding routing recoverability and cost-quality trade-offs were shown to be fragile under artifact-free measurement, casting doubt on headline cost reduction claims from earlier work (Chen et al., 2023, Lu et al., 30 Sep 2025). The artifact decomposition and correction framework should inform the design of future benchmarks, competition leaderboards, and modeling protocols. Future directions include artifact-robust metric designs, unified judge-exact-match meta-labels, and adaptive budget allocation.

Conclusion

This large-scale empirical study provides a detailed taxonomy and quantification of evaluation artifacts that inflate unsolvability ceilings in multi-LLM routing. The findings demonstrate that evaluation misalignment, truncation, and format mismatch—rather than core model limitations—govern much of the measured routing opportunity. These artifacts fundamentally distort both evaluation and training of routing models, resulting in substantial real-world opportunity costs and unreliable infrastructure-level conclusions. The recommended evaluation and router training protocols are essential for robust, artifact-resilient deployment and future algorithmic progress in multi-LLM systems.

References:

"FrugalGPT: How to Use LLMs While Reducing Cost and Improving Performance" (Chen et al., 2023)
"RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers" (Lu et al., 30 Sep 2025)
"RouteLLM: Learning to Route LLMs with Preference Data" [Ong et al., 2025]
"MoDEM: Mixture of Domain Expert Models" (Simonds et al., 2024)
"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" [Zheng et al., 2023]