Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

Avengers-Pro LLM Routing Framework

Updated 24 August 2025
  • Avengers-Pro is a test-time routing framework that ensembles diverse large language models using semantic embeddings and unsupervised clustering.
  • It employs a tunable trade-off parameter (α) to balance normalized accuracy and cost, ensuring optimal performance across multiple benchmarks.
  • By consistently tracing the Pareto frontier, the framework outperforms single-model baselines with up to a 7% accuracy boost and substantial cost reductions.

The Avengers-Pro framework is a test-time routing system centered on ensembling LLMs with heterogeneous capacities and computational costs. Designed to provide a unified, parameterizable solution to the accuracy-versus-efficiency dilemma in LLM deployment, Avengers-Pro utilizes semantic query embedding, unsupervised clustering, and explicit routing based on a performance–efficiency trade-off score. Empirical results across challenging benchmarks demonstrate that Avengers-Pro can surpass the strongest single LLM in accuracy, match its performance while requiring substantially lower computational cost, and consistently trace the Pareto frontier among all competitive methods.

1. System Architecture and Design Principles

The core architecture of Avengers-Pro generalizes prior routing frameworks by supporting ensembling of any number of diverse LLMs, including models from distinct families such as Google Gemini-2.5, Anthropic Claude, OpenAI GPT-5, and Qwen. Unlike frameworks that route queries between just two models, Avenger-Pro employs query embedding followed by k-means clustering (k=60k=60), yielding clusters that approximate semantically coherent task types. For each LLM, performance (accuracy) and efficiency (cost) profiles are calibrated offline for each cluster. At inference, incoming queries are mapped to clusters via their embeddings, and the model selection process is governed by a tunable trade-off parameter α\alpha that directly weights normalized accuracy versus normalized cost.

The system thus enables parameterized deployment, allowing practitioners to dynamically select points on the accuracy-cost spectrum by adjusting α\alpha according to operational requirements or resource constraints.

2. Routing and Model Selection Methodology

The Avengers-Pro routing mechanism proceeds in distinct sequential steps:

  1. Text Embedding: Each incoming query dd is mapped to a high-dimensional semantic vector using a fixed embedding model (e.g., Qwen3-embedding, $4096$ dimensions).
  2. Clustering: Embedded queries are grouped into kk clusters by k-means. Each cluster is designed to capture semantically related queries, supporting differential routing based on historical model performance and cost.
  3. Performance–Efficiency Profiling: For each model ii in cluster jj, accuracy pjip_j^i and cost qjiq_j^i are measured on validation data. These metrics are normalized for each cluster:

p~ji=pjipjminpjmaxpjmin,q~ji=qjiqjminqjmaxqjmin\tilde{p}_j^i = \frac{p_j^i - p_j^{min}}{p_j^{max} - p_j^{min}}, \quad \tilde{q}_j^i = \frac{q_j^i - q_j^{min}}{q_j^{max} - q_j^{min}}

  1. Routing Decision: Given trade-off parameter α[0,1]\alpha \in [0,1], the score for model ii in cluster jj is calculated as

xji=αp~ji+(1α)(1q~ji)x_j^i = \alpha \cdot \tilde{p}_j^i + (1 - \alpha) \cdot (1 - \tilde{q}_j^i)

For each query, the top-pp nearest clusters are identified in embedding space (p=4p=4 in reported experiments), and aggregated scores are used to select the optimal model.

  1. Inference: The selected model generates the final output.

This design supports real-time routing across an ensemble whose composition and routing logic are determined solely by kk, pp, and α\alpha.

3. Benchmark Results and Quantitative Performance

Avengers-Pro has been validated on six challenging benchmarks: GPQA-Diamond, Human’s Last Exam, ARC-AGI, SimpleQA, LiveCodeBench, and τ2\tau^2-bench. Across these tasks, eight LLMs were included, with GPT-5-medium averaging 62.25% in accuracy as the strongest single-model baseline.

Key empirical findings include:

Configuration Accuracy (%) Cost Reduction (%)
Avengers-Pro (best α\alpha) 66.66
Avengers-Pro (match GPT-5-medium) 62.25 27
Avengers-Pro (\sim90% GPT-5 perf) \sim56.0 63
Single (Gemini-2.5-pro match) (match) 81

The framework can outperform the strongest single model by approximately 7% in average accuracy, match leading performance at substantially lower cost, and maintain \sim90% accuracy at a cost less than half that of the top baseline.

Illustrative “elbows” in accuracy–cost curves (α ≈ 0.4, 0.6) identify notable trade-off regions where incremental cost yields maximal performance gains.

4. Pareto Frontier Analysis

A central property of Avengers-Pro is its ability to consistently realize the Pareto frontier in accuracy–cost space. This implies that, for any specified compute budget, no other single LLM yields higher accuracy; conversely, for any accuracy threshold, Avengers-Pro finds the configuration (via α\alpha) that minimizes cost beyond what any static single-model solution can deliver.

This distinctive capability underscores the framework’s suitability for high-stakes, resource-sensitive applications where operational efficiency, cost containment, and performance excellence are simultaneous priorities.

5. Generalization, Robustness, and Parameter Effects

Avengers-Pro demonstrates robust generalization and stability:

  • The clustering-based routing is adaptable; new domains or tasks are incorporated by recalibrating cluster profiles without retraining any neural router.
  • Accuracy remains stable across different embedding models and clustering algorithms, suggesting resilience to architectural or hyperparameter choices.
  • The number of clusters kk is robust within a broad intermediate range; performance does not deteriorate sharply with kk changes.
  • Model pool size can be scaled as needed; expansion enables finer control of Pareto optimality in the accuracy–cost trade-off.

A plausible implication is that even greater performance–efficiency flexibility could be achieved by further diversifying the LLM pool or developing adaptive clustering strategies.

6. Implementation and Deployment Considerations

All code and implementation details are openly available: https://github.com/ZhangYiqun018/AvengersPro. The framework leverages well-established machine learning libraries (e.g., scikit-learn for clustering), fixed pre-trained embedding models, and distributed inference (served on modern multi-GPU hardware).

No additional neural network training or prompt engineering is required. Offline calibration (embedding, clustering, profiling) is computationally light compared to fine-tuning or bespoke router training. The routing and selection processes are plug-and-play, supporting rapid deployment, evaluation, and reconfiguration.

For practitioners, the recommended workflow is:

  1. Calibrate cluster-wise accuracy and cost profiles for all candidate LLMs.
  2. Select kk, pp, and α\alpha based on target application requirements.
  3. Deploy the routing module and candidate models in a distributed configuration.
  4. Monitor accuracy/cost and adjust α\alpha as needs evolve.

7. Prospective Research Directions

Several future avenues are outlined:

  • Expansion to chat-based and dialogue tasks, where cluster profiles might reflect conversational context or intent rather than just topical similarity.
  • Integration of LLMs with greater diversity in scale and domain, as well as exploration of even larger ensemble pools.
  • Investigation of dynamic ensemble strategies and routing algorithms that adapt pp (number of clusters or models considered) and α\alpha at runtime, possibly using meta-learning approaches.
  • Deeper analysis of the interplay between clustering granularity and effectiveness, suggesting work toward adaptive k-selection methods based on observed query distribution.

This suggests that Avengers-Pro may serve as a foundation for broader research in test-time model routing, multi-model collaboration, and resource-aware AI system design.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube