- The paper presents a multi-level fusion framework that integrates LLM routing data at query, thought, and model levels to optimize performance and cost.
- Empirical results demonstrate up to 16% reward improvement and significant gains in reasoning tasks, highlighting the effectiveness of thought-level fusion.
- The study reveals practical trade-offs where query-level fusion offers cost-efficient deployment while model-level fusion faces overfitting risks on heterogeneous tasks.
Fusing LLM Capabilities with Routing Data: A Multi-Level Framework for LLM Integration
The paper "Fusing LLM Capabilities with Routing Data" (2507.10540) addresses the underutilization of large-scale LLM routing data in the context of multi-model integration. The authors introduce FusionBench, a comprehensive benchmark for LLM routing and fusion, and FusionFactory, a systematic framework for fusing LLM capabilities at three distinct levels: query, thought, and model. The work is grounded in the observation that LLM hosting platforms routinely collect rich routing data—user queries and corresponding responses from multiple LLMs—which encodes valuable information about the comparative strengths and weaknesses of different models across a wide range of tasks.
FusionBench: A Large-Scale Routing Benchmark
FusionBench is constructed to support research on LLM capability fusion by providing:
- Diverse Response Patterns: For each query, responses are collected from 20 open-source LLMs (ranging from 8B to 671B parameters) using both direct and reasoning-augmented prompting. This results in a dataset of 103M tokens covering 14 tasks across six domains (Math, Code, Commonsense Reasoning, World Knowledge, Reading Comprehension, and Popular Knowledge).
- Reusable Thought Templates: For each query, the top-k responses (by performance or LLM-judge score) are summarized into abstract thought templates, enabling mid-level fusion strategies.
- Comprehensive Evaluation Signals: Each response is annotated with task-specific metrics, token cost estimates, and LLM-judge scores that assess not only correctness but also reasoning quality and suitability for supervised fine-tuning.
This benchmark enables systematic exploration of LLM fusion strategies and provides a reproducible foundation for future research.
FusionFactory: Multi-Level LLM Fusion Framework
FusionFactory operationalizes LLM fusion at three representative levels, each corresponding to a different stage in the LLM inference pipeline:
1. Query-Level Fusion (Early Fusion)
At this level, a router is trained to select the most appropriate LLM for each incoming query, optimizing a reward function that balances performance, cost, and LLM-judge quality. The router can be implemented using various architectures (e.g., KNN, SVM, MLP, BERT, or graph-based models like GraphRouter). The reward function is parameterized as:
1
|
Reward = α * Performance - β * Cost + γ * LLM-Judge |
Empirical results show that advanced routers (notably GraphRouter) consistently outperform the best single LLM and even the largest LLMs, achieving up to 16% improvement in reward and 5.7% gain in LLMscore under LLM-judge evaluation. Query-level fusion is also the most cost-efficient, requiring only lightweight router training.
2. Thought-Level Fusion (Mid Fusion)
This level leverages the diversity of reasoning patterns across LLMs by constructing thought templates—generalized reasoning strategies distilled from top-performing responses. For a new query, similar past queries are retrieved via embedding similarity, and their thought templates are used as few-shot demonstrations to guide the LLM's response.
Key findings include:
- Substantial Performance Gains: Thought-level fusion yields the largest improvements, especially on reasoning-intensive tasks (e.g., +21.3% in math for small models, +57.6% in code for large models).
- Hybrid Selection Strategy: Combining performance and LLM-judge criteria for selecting demonstration responses provides the best trade-off between accuracy and reasoning quality.
- Diminishing Returns: Increasing the size of the summarizer or the number of responses does not always yield further gains, but does increase computational cost.
3. Model-Level Fusion (Late Fusion)
Model-level fusion is implemented via supervised fine-tuning (SFT) of a base LLM on a dataset constructed from the top-k responses (by performance or LLM-judge score) for each query. This approach is analogous to knowledge distillation and imitation learning.
Empirical observations:
- Modest Average Improvements: Model-level fusion provides gains in 4 out of 6 domains but can underperform in code tasks due to domain mismatch and data scarcity.
- LLM-Judge Supervision: Using LLM-judge scores for response selection yields the best results, as it provides a more nuanced assessment than binary task metrics.
- Overfitting Risk: Training on multiple responses per query can lead to overfitting and reduced generalization, especially when fusing heterogeneous tasks.
Comparative Analysis and Practical Implications
A cross-level comparison reveals that:
- Thought-Level Fusion Achieves the Best Overall Performance: Especially when using hybrid selection and large summarizers, thought-level fusion outperforms both query- and model-level approaches across most domains.
- Query-Level Fusion Offers the Best Efficiency-Performance Trade-off: It is highly practical for deployment, requiring minimal additional computation beyond router training.
- Model-Level Fusion is Least Effective: It is susceptible to overfitting and struggles with heterogeneous task requirements.
Domain-specific analysis indicates that fusion is most challenging in World Knowledge and Math, where factual accuracy and strict logical consistency are paramount and multi-model collaboration can introduce noise.
Implementation Considerations
- Computational Requirements: FusionBench and FusionFactory require access to multiple LLMs and significant compute for data collection, template generation, and fine-tuning. However, query-level fusion can be deployed with minimal overhead.
- Router Training: For practical deployment, lightweight routers (e.g., MLPs or compact transformers) can be trained on routing logs, leveraging embeddings from models like MiniLM or Longformer.
- Template Summarization: Thought templates can be generated using a strong LLM (e.g., LLaMA-3 70B) and stored for efficient retrieval during inference.
- Fine-Tuning: Model-level fusion should be approached with caution, using careful selection of training data and regularization to mitigate overfitting.
Broader Implications and Future Directions
This work demonstrates that large-scale routing data is a valuable resource for both model selection and capability fusion. The multi-level framework enables practitioners to flexibly integrate diverse LLMs, optimizing for performance, cost, and reasoning quality according to application requirements.
Potential future directions include:
- Trustworthiness and Bias Mitigation: Investigating how fusion strategies affect the reliability and fairness of LLM outputs, and developing safeguards against the propagation of biases or errors.
- Dynamic and Adaptive Fusion: Exploring online learning and reinforcement learning approaches for adaptive router and template selection in changing environments.
- Extension to Multimodal and Multilingual Settings: Applying the framework to settings involving vision-LLMs or cross-lingual tasks.
The release of FusionBench and FusionFactory provides a foundation for systematic research on LLM integration, with direct applicability to real-world LLM hosting platforms, multi-agent systems, and cost-sensitive AI deployments.