Avengers-Pro LLM Routing Framework
- Avengers-Pro is a test-time routing framework that ensembles diverse large language models using semantic embeddings and unsupervised clustering.
- It employs a tunable trade-off parameter (α) to balance normalized accuracy and cost, ensuring optimal performance across multiple benchmarks.
- By consistently tracing the Pareto frontier, the framework outperforms single-model baselines with up to a 7% accuracy boost and substantial cost reductions.
The Avengers-Pro framework is a test-time routing system centered on ensembling LLMs with heterogeneous capacities and computational costs. Designed to provide a unified, parameterizable solution to the accuracy-versus-efficiency dilemma in LLM deployment, Avengers-Pro utilizes semantic query embedding, unsupervised clustering, and explicit routing based on a performance–efficiency trade-off score. Empirical results across challenging benchmarks demonstrate that Avengers-Pro can surpass the strongest single LLM in accuracy, match its performance while requiring substantially lower computational cost, and consistently trace the Pareto frontier among all competitive methods.
1. System Architecture and Design Principles
The core architecture of Avengers-Pro generalizes prior routing frameworks by supporting ensembling of any number of diverse LLMs, including models from distinct families such as Google Gemini-2.5, Anthropic Claude, OpenAI GPT-5, and Qwen. Unlike frameworks that route queries between just two models, Avenger-Pro employs query embedding followed by k-means clustering (), yielding clusters that approximate semantically coherent task types. For each LLM, performance (accuracy) and efficiency (cost) profiles are calibrated offline for each cluster. At inference, incoming queries are mapped to clusters via their embeddings, and the model selection process is governed by a tunable trade-off parameter that directly weights normalized accuracy versus normalized cost.
The system thus enables parameterized deployment, allowing practitioners to dynamically select points on the accuracy-cost spectrum by adjusting according to operational requirements or resource constraints.
2. Routing and Model Selection Methodology
The Avengers-Pro routing mechanism proceeds in distinct sequential steps:
- Text Embedding: Each incoming query is mapped to a high-dimensional semantic vector using a fixed embedding model (e.g., Qwen3-embedding, $4096$ dimensions).
- Clustering: Embedded queries are grouped into clusters by k-means. Each cluster is designed to capture semantically related queries, supporting differential routing based on historical model performance and cost.
- Performance–Efficiency Profiling: For each model in cluster , accuracy and cost are measured on validation data. These metrics are normalized for each cluster:
- Routing Decision: Given trade-off parameter , the score for model in cluster is calculated as
For each query, the top- nearest clusters are identified in embedding space ( in reported experiments), and aggregated scores are used to select the optimal model.
- Inference: The selected model generates the final output.
This design supports real-time routing across an ensemble whose composition and routing logic are determined solely by , , and .
3. Benchmark Results and Quantitative Performance
Avengers-Pro has been validated on six challenging benchmarks: GPQA-Diamond, Human’s Last Exam, ARC-AGI, SimpleQA, LiveCodeBench, and -bench. Across these tasks, eight LLMs were included, with GPT-5-medium averaging 62.25% in accuracy as the strongest single-model baseline.
Key empirical findings include:
Configuration | Accuracy (%) | Cost Reduction (%) |
---|---|---|
Avengers-Pro (best ) | 66.66 | – |
Avengers-Pro (match GPT-5-medium) | 62.25 | 27 |
Avengers-Pro (90% GPT-5 perf) | 56.0 | 63 |
Single (Gemini-2.5-pro match) | (match) | 81 |
The framework can outperform the strongest single model by approximately 7% in average accuracy, match leading performance at substantially lower cost, and maintain 90% accuracy at a cost less than half that of the top baseline.
Illustrative “elbows” in accuracy–cost curves (α ≈ 0.4, 0.6) identify notable trade-off regions where incremental cost yields maximal performance gains.
4. Pareto Frontier Analysis
A central property of Avengers-Pro is its ability to consistently realize the Pareto frontier in accuracy–cost space. This implies that, for any specified compute budget, no other single LLM yields higher accuracy; conversely, for any accuracy threshold, Avengers-Pro finds the configuration (via ) that minimizes cost beyond what any static single-model solution can deliver.
This distinctive capability underscores the framework’s suitability for high-stakes, resource-sensitive applications where operational efficiency, cost containment, and performance excellence are simultaneous priorities.
5. Generalization, Robustness, and Parameter Effects
Avengers-Pro demonstrates robust generalization and stability:
- The clustering-based routing is adaptable; new domains or tasks are incorporated by recalibrating cluster profiles without retraining any neural router.
- Accuracy remains stable across different embedding models and clustering algorithms, suggesting resilience to architectural or hyperparameter choices.
- The number of clusters is robust within a broad intermediate range; performance does not deteriorate sharply with changes.
- Model pool size can be scaled as needed; expansion enables finer control of Pareto optimality in the accuracy–cost trade-off.
A plausible implication is that even greater performance–efficiency flexibility could be achieved by further diversifying the LLM pool or developing adaptive clustering strategies.
6. Implementation and Deployment Considerations
All code and implementation details are openly available: https://github.com/ZhangYiqun018/AvengersPro. The framework leverages well-established machine learning libraries (e.g., scikit-learn for clustering), fixed pre-trained embedding models, and distributed inference (served on modern multi-GPU hardware).
No additional neural network training or prompt engineering is required. Offline calibration (embedding, clustering, profiling) is computationally light compared to fine-tuning or bespoke router training. The routing and selection processes are plug-and-play, supporting rapid deployment, evaluation, and reconfiguration.
For practitioners, the recommended workflow is:
- Calibrate cluster-wise accuracy and cost profiles for all candidate LLMs.
- Select , , and based on target application requirements.
- Deploy the routing module and candidate models in a distributed configuration.
- Monitor accuracy/cost and adjust as needs evolve.
7. Prospective Research Directions
Several future avenues are outlined:
- Expansion to chat-based and dialogue tasks, where cluster profiles might reflect conversational context or intent rather than just topical similarity.
- Integration of LLMs with greater diversity in scale and domain, as well as exploration of even larger ensemble pools.
- Investigation of dynamic ensemble strategies and routing algorithms that adapt (number of clusters or models considered) and at runtime, possibly using meta-learning approaches.
- Deeper analysis of the interplay between clustering granularity and effectiveness, suggesting work toward adaptive k-selection methods based on observed query distribution.
This suggests that Avengers-Pro may serve as a foundation for broader research in test-time model routing, multi-model collaboration, and resource-aware AI system design.