- The paper identifies that evaluation artifacts, not intrinsic model limitations, largely inflate unsolvability ceilings in multi-LLM routing.
- It employs a dual-judge framework across over 206,000 query-model pairs from six benchmarks to distinguish genuine unsolvability from protocol-induced errors.
- Artifacts such as truncation and output format mismatches misguide router training, causing majority-class collapse and loss of 13–17 pp routing gains.
Unsolvability Ceiling in Multi-LLM Routing: Deconstructing Evaluation Artifacts
Overview
The paper conducts an extensive empirical analysis of multi-LLM routing for cost-effective deployment, focusing on the so-called "unsolvability ceiling"—the proportion of queries unsolvable by any model in a pool. Evaluating over 206,000 query-model pairs across six benchmarks and leveraging the Gemma~4 and Llama~3.1 families, the authors demonstrate that a substantial part of reported unsolvability is not due to fundamental model limits but to artifacts induced by the evaluation protocols. The work presents a decomposition framework that distinguishes genuine unsolvability from three dominant artifact types: evaluation misalignment, truncation, and output format mismatch. They further show how these artifacts misguide router training, resulting in high opportunity costs and unreliable headroom estimates, with strong implications for multi-LLM infrastructure in both research and deployment contexts.
Empirical Framework and Experimental Design
The study systematically evaluates four tiers of Gemma~4 (2B dense to 31B dense, including a 26B-A4B MoE with equal active compute as the 4B dense) on dedicated H100s, and cross-validates findings using Llama~3.1-8B/70B under strictly greedy, non-thinking-mode decoding. Six diverse benchmarks are analyzed: MMLU, MedQA, Alpaca, ShareGPT, HumanEval, and MBPP, yielding a total of 206,756 query-model pairs.
Evaluation relies on both LLM-as-a-judge (Gemma~4 26B-A4B, 0/1/2 rubric) and exact-match metrics (for MMLU and MedQA). This dual-judge methodology is essential, as it enables attribution of observed failure rates to specific artifact classes, rather than to fundamental modeling deficits.
Artifact Decomposition and Unsolvability Inflation
Artifact 1: Evaluation Misalignment
Systematic divergence between LLM-judge scoring and exact-match criteria is observed, with directionality dependent on task. On MMLU, the judge consistently underrates model performance—up to 23.8 percentage points (pp) at 31B; for MedQA, large models are overrated by the judge relative to exact-match by 5–6 pp. The judge rewards fluency, reasoning presentation, and structure, occasionally granting partial or full credit to factually incorrect but well-argued outputs, while penalizing terse or nonstandard valid responses. This creates an artificial gap between measured and true unsolvability on both ends.
Artifact 2: Truncation
High truncation rates under fixed generation budgets (65% for MMLU, 57% for MedQA) cause substantial output incompleteness, especially for knowledge-heavy tasks or smaller models that require more tokens to respond. Truncated responses not only score artificially low with the judge but also fail exact-match extraction, thus compounding unsolvability inflation and suppressing upper-bound estimates for routing gain.
Automatic extraction of answers (particularly answer letters in multiple-choice formats) fails for 5.4–12.4% of MMLU responses and 0.6–5.2% of MedQA, with higher rates for smaller models. This disproportionately inflates apparent unsolvability and compresses performance gradients between tiers.
Dual-Judge Validation and Corrected Ceiling Estimates
By grounding LLM-judge metrics with exact-match, the authors reveal strong bidirectional distortions. For MMLU, the judge underestimates correct answers by up to 23.8 pp at 31B, obscuring an actual routing gain of +28.9 pp (exact-match) compared to +15.9 pp (judge-derived). For MedQA, judge-overestimation leads to a 10.3 pp inflation of routing gain. The aggregate opportunity cost attributable to these artifacts is measured at 13–17 pp for knowledge-intensive tasks.
Cross-family analysis with Llama-3.1 and DeepSeek-Chat demonstrates that the knowledge task ceiling is model-training-specific, while for code generation and conversational tasks (e.g., ShareGPT), unsolvability is nearly universal and family-agnostic.
Router Training and the Collapse Phenomenon
Standard routers trained on artifact-inflated labels collapse to the majority class due to severe label imbalance (e.g., E2B optimal for 79.3% of queries). All three evaluated router architectures—logistic regression, MLP, DistilBERT—have near-zero recall on minority classes and fail to make effective tier-based routing decisions. Random-feature and shuffled-label controls confirm this is a byproduct of the oracle label marginal, not model expressivity or training pathologies.
A balanced-classifier (cost-sensitive objective) partially mitigates the collapse, confirming that the routing challenge is not fundamental but protocol-induced. The dominant routing signal in non-cost-sensitive routers degenerates into length-based triage, lacking true difficulty discrimination. This is empirically linked to a missed routing gain of 13–17 pp across knowledge-intensive settings, attributable to misaligned oracle labels.
Latency, Cost, and Route Selection
Latency analysis reveals that the 26B-A4B MoE tier is a Pareto-optimal routing target within the Gemma~4 family: it offers a significant quality gain over its dense compute equivalents with near-identical inference cost and stricter tail latency, justifying its use except for the hardest queries that uniquely require the 31B dense model (only 5.5% of queries).
Recommendations and Implications
Evaluation Protocols
- Dual-judge validation using both LLM-as-a-judge and task-specific exact-match is essential for reliable unsolvability and routing gain estimation.
- Truncation-aware budgets should keep response truncation below 5%, with rates explicitly reported.
- Oracle label construction for router training must align with deployment objectives; if downstream tasks require terminal answer correctness, exact-match should be prioritized.
Router Architectures
- Cost-sensitive objectives (e.g., class weighting, minority oversampling) should be standard to avoid majority-class collapse.
- Domain-detection as a pre-filter improves routing on knowledge-intensive tasks by circumventing label skew.
- Oracle label auditing for evaluation misalignment can reduce the opportunity cost associated with artifact-induced headroom inflation.
Theoretical and Practical Implications
The work exposes fundamental methodological flaws in widely used multi-tier LLM routing evaluations. Assumptions regarding routing recoverability and cost-quality trade-offs were shown to be fragile under artifact-free measurement, casting doubt on headline cost reduction claims from earlier work (Chen et al., 2023, Lu et al., 30 Sep 2025). The artifact decomposition and correction framework should inform the design of future benchmarks, competition leaderboards, and modeling protocols. Future directions include artifact-robust metric designs, unified judge-exact-match meta-labels, and adaptive budget allocation.
Conclusion
This large-scale empirical study provides a detailed taxonomy and quantification of evaluation artifacts that inflate unsolvability ceilings in multi-LLM routing. The findings demonstrate that evaluation misalignment, truncation, and format mismatch—rather than core model limitations—govern much of the measured routing opportunity. These artifacts fundamentally distort both evaluation and training of routing models, resulting in substantial real-world opportunity costs and unreliable infrastructure-level conclusions. The recommended evaluation and router training protocols are essential for robust, artifact-resilient deployment and future algorithmic progress in multi-LLM systems.
References:
- "FrugalGPT: How to Use LLMs While Reducing Cost and Improving Performance" (Chen et al., 2023)
- "RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers" (Lu et al., 30 Sep 2025)
- "RouteLLM: Learning to Route LLMs with Preference Data" [Ong et al., 2025]
- "MoDEM: Mixture of Domain Expert Models" (Simonds et al., 2024)
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" [Zheng et al., 2023]