Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 81 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 104 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing (2508.12631v1)

Published 18 Aug 2025 in cs.CL

Abstract: Balancing performance and efficiency is a central challenge in LLM advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents the Avengers-Pro framework that ensembles heterogeneous LLMs to optimize the performance–efficiency trade-off via a tunable alpha parameter.
  • It employs semantic query embedding and clustering to achieve 66.66% average accuracy and reduce costs by up to 63% compared to single-model systems.
  • The approach establishes a Pareto frontier in cost–accuracy, offering scalable, practical insights for real-world LLM deployment.

Performance–Efficiency Optimized Routing for LLMs: The Avengers-Pro Framework

Introduction

The paper "Beyond GPT-5: Making LLMs Cheaper and Better via Performance–Efficiency Optimized Routing" (2508.12631) presents Avengers-Pro, a test-time routing framework that ensembles heterogeneous LLMs to optimize the trade-off between performance (accuracy) and efficiency (cost). The framework generalizes the routing paradigm introduced in GPT-5, extending it to a broader set of models and enabling fine-grained control over the performance–efficiency balance via a tunable parameter α\alpha. Avengers-Pro demonstrates state-of-the-art results across six challenging benchmarks and eight leading LLMs, consistently achieving a Pareto frontier in the cost–accuracy space.

Methodology: Routing for Performance–Efficiency Trade-off

Avengers-Pro operates by embedding incoming queries, clustering them by semantic similarity, and routing each query to the most suitable model based on a performance–efficiency score. The routing process is formalized as follows:

  1. Query Embedding: Each query is encoded into a high-dimensional semantic vector using a text embedding model (Qwen3-embedding-8B, 4096 dimensions).
  2. Clustering: Queries are grouped into kk clusters (k-means, k=60k=60), each representing a semantically coherent query type.
  3. Model Profiling: For each model ii and cluster cjc_j, performance (pjip_j^i) and efficiency (qjiq_j^i, measured as cost) are estimated using labeled data.
  4. Performance–Efficiency Scoring: The score for model ii on cluster cjc_j is computed as:

xji=α p~ji+(1−α) (1−q~ji)x_j^i = \alpha \, \tilde{p}_j^i + (1 - \alpha) \, (1-\tilde{q}_j^i)

where α∈[0,1]\alpha \in [0, 1] controls the trade-off, and p~ji\tilde{p}_j^i, q~ji\tilde{q}_j^i are normalized performance and cost.

  1. Routing Decision: At inference, the query is assigned to its top-pp nearest clusters (p=4p=4). The model with the highest aggregated score over these clusters is selected to generate the response.

This approach enables dynamic, query-specific model selection, leveraging the strengths of both high-capacity and efficient models.

Experimental Setup

Avengers-Pro is evaluated on six benchmarks: GPQA-Diamond, Human's Last Exam, HealthBench, ARC-AGI, SimpleQA, LiveCodeBench, and τ2\tau^2-bench. The ensemble comprises eight models from four families: GPT-5-chat, GPT-5-medium, Claude-4.1-opus, Claude-4-sonnet, Gemini-2.5-pro, Gemini-2.5-flash, Qwen3, and Qwen3-thinking. All models are accessed via the OpenRouter API, ensuring standardized cost and interface.

Results: Performance and Efficiency Trade-offs

Avengers-Pro consistently outperforms the strongest single model (GPT-5-medium) in both accuracy and cost efficiency. Key results include:

  • Accuracy Gain: With α=1.0\alpha=1.0, Avengers-Pro achieves 66.66% average accuracy, surpassing GPT-5-medium by 7%.
  • Cost Reduction: At comparable accuracy to GPT-5-medium, Avengers-Pro (α=0.53\alpha=0.53) reduces cost by 27%. At 90% of GPT-5-medium's accuracy, cost is reduced by 63%.
  • Pareto Frontier: For any fixed cost, Avengers-Pro delivers the highest accuracy among all models; for any fixed accuracy, it achieves the lowest cost. Figure 1

    Figure 1: Effects of the trade-off parameter α\alpha on performance and efficiency; increasing α\alpha prioritizes accuracy over cost, with two elbows indicating favorable trade-off regions.

The trade-off parameter α\alpha enables fine-grained control: low α\alpha values favor efficient models (Qwen3, Qwen3-thinking), while high α\alpha values route more queries to high-capacity models (GPT-5-medium, Gemini-2.5-pro). Figure 2

Figure 2: Proportion of model usage as a function of α\alpha; low α\alpha routes to Qwen3/Qwen3-thinking, high α\alpha increases usage of GPT-5-medium and other high-capacity models.

Analysis and Implications

The empirical results demonstrate that test-time routing with performance–efficiency optimization is a robust strategy for deploying LLMs in production environments where both cost and accuracy are critical. The framework's ability to achieve a Pareto frontier is particularly notable, as it guarantees optimality in the cost–accuracy space relative to any single model.

The methodology is generalizable and can be extended to larger ensembles, more granular clustering, or alternative scoring functions. The use of semantic clustering and per-cluster profiling allows for nuanced routing decisions that adapt to query complexity and model specialization.

From a practical perspective, Avengers-Pro enables organizations to deploy LLMs with predictable cost–performance characteristics, dynamically adjusting to workload requirements and budget constraints. The framework is compatible with existing model APIs and can be integrated into inference pipelines with minimal overhead.

Future Directions

Potential avenues for future research include:

  • Adaptive Clustering: Dynamic adjustment of cluster granularity based on query distribution.
  • Multi-model Routing: Extending routing to allow multi-model ensembles per query (e.g., voting or fusion).
  • Latency Optimization: Incorporating latency as an additional efficiency metric.
  • Online Learning: Updating model profiles and cluster assignments in real-time as new data arrives.
  • Generalization to Other Modalities: Applying the routing framework to multi-modal LLMs and agentic systems.

Conclusion

Avengers-Pro establishes a principled framework for optimizing the performance–efficiency trade-off in LLM inference via test-time routing. By leveraging semantic clustering and per-cluster model profiling, it consistently outperforms single-model baselines, achieving superior accuracy and cost efficiency. The approach is scalable, adaptable, and directly applicable to real-world LLM deployment scenarios, with significant implications for both research and industry.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com