- The paper presents a multi-level fusion framework that integrates query, thought, and model-level strategies to harness LLM routing data.
- FusionBench, built from 103M tokens over 14 tasks and 20 models, underpins the methodology by providing diverse routing and reasoning-augmented responses.
- Empirical results demonstrate up to 16% improvement in routing, 21.3% in math, and 57.6% in code tasks, validating the framework’s practical benefits.
Fusing LLM Capabilities with Routing Data: A Multi-Level Fusion Framework
The paper "Fusing LLM Capabilities with Routing Data" (2507.10540) addresses the underutilization of large-scale LLM routing data in the context of multi-model integration. The authors introduce FusionBench, a comprehensive benchmark for LLM routing and fusion, and FusionFactory, a systematic framework for fusing LLM capabilities at three distinct levels: query, thought, and model. The work is grounded in the observation that LLM hosting platforms routinely generate rich routing data—mapping queries to responses from diverse models—yet this data is rarely leveraged for capability fusion beyond simple routing.
FusionBench: A Large-Scale Routing Benchmark
FusionBench is constructed from 103M tokens, covering 14 tasks across six domains (Math, Code, Commonsense Reasoning, World Knowledge, Reading Comprehension, and Popular Knowledge) and 20 open-source LLMs ranging from 8B to 671B parameters. For each query, FusionBench collects both direct (LLM-direct) and reasoning-augmented (LLM-think) responses, and introduces reusable "thought templates" distilled from top-performing LLMs. This design enables the paper of not only which model to route a query to, but also how to synthesize and transfer reasoning strategies across models.
Key features of FusionBench include:
- Multiple response patterns per query: Both direct and reasoning-based outputs are collected.
- Reusable thought templates: Abstracted reasoning steps are summarized from top-k LLM responses.
- Comprehensive coverage: 14 tasks, 20 LLMs, and a large token count enable robust, reproducible experiments.
FusionFactory: Multi-Level LLM Fusion
FusionFactory operationalizes LLM fusion at three levels, each corresponding to a different stage in the reasoning and response pipeline:
1. Query-Level Fusion (Early Fusion)
A router is trained to select the most suitable LLM for each query, optimizing a reward function that balances performance, cost, and LLM-judge scores. The router can be implemented as a neural network (e.g., MLP, BERT-based classifier, or graph-based model like GraphRouter) that takes as input query/task embeddings and LLM features.
Implementation Example:
1
2
3
4
5
6
7
8
9
|
def route_query(query, task, LLM_features, router_model):
# Compute embeddings
query_emb = embed(query)
task_emb = embed(task)
# Concatenate features
features = concatenate([query_emb, task_emb, LLM_features])
# Predict best LLM
LLM_idx = router_model.predict(features)
return LLM_idx |
Empirical Results: GraphRouter achieves up to 16% improvement in reward over the best single LLM, and consistently outperforms static baselines across all scales.
2. Thought-Level Fusion (Mid Fusion)
This level leverages thought templates—generalized reasoning patterns distilled from top LLM responses to similar queries. For a new query, the system retrieves the most similar past queries (via embedding similarity), extracts their thought templates, and uses them as few-shot demonstrations to guide the response generation.
Implementation Example:
1
2
3
4
5
6
7
8
9
10
|
def generate_with_thoughts(query, thought_db, LLM):
# Retrieve similar queries
similar_queries = retrieve_similar_queries(query, thought_db)
# Extract thought templates
templates = [thought_db[q]['template'] for q in similar_queries]
# Construct prompt
prompt = build_prompt(query, templates)
# Generate response
response = LLM.generate(prompt)
return response |
Empirical Results: Thought-level fusion yields the highest average performance, with up to 21.3% improvement in math and 57.6% in code for small and large models, respectively. The hybrid selection strategy (combining performance and LLM-judge scores) provides the best trade-off between accuracy and response quality.
3. Model-Level Fusion (Late Fusion)
Here, the best responses (by performance or LLM-judge) from multiple LLMs are used as supervised fine-tuning data for a target model. This is analogous to knowledge distillation, where the student model learns from the diverse outputs of multiple teachers.
Implementation Example:
1
2
3
4
5
6
|
def distill_from_responses(train_data, top_k_responses, base_LLM):
# Prepare SFT dataset: (query, best_response)
sft_data = [(q, select_top_k_responses(q, responses)) for q, responses in train_data.items()]
# Fine-tune base LLM
fine_tuned_LLM = supervised_finetune(base_LLM, sft_data)
return fine_tuned_LLM |
Empirical Results: Model-level fusion provides modest improvements in 4 out of 6 domains, but is less effective than thought-level fusion, especially in code tasks where domain mismatch and data scarcity are problematic. LLM-judge–based selection of responses for distillation outperforms strict metric-based selection.
Comparative Analysis and Practical Implications
A cross-level comparison reveals that:
- Thought-level fusion is most effective for tasks requiring complex reasoning (math, code, commonsense), but less so for factual recall (world knowledge).
- Query-level fusion is cost-efficient and practical for real-world deployment, requiring only a lightweight router and no model retraining.
- Model-level fusion is limited by overfitting and domain heterogeneity, especially when using a single student model for diverse tasks.
Notable empirical findings:
- FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks.
- The optimal fusion strategy varies by domain and task type.
- Gains from fusion are moderate in domains demanding strict factual accuracy or logical consistency.
Implementation Considerations
- Computational Requirements: Query-level fusion is lightweight; thought-level fusion requires retrieval and prompt construction; model-level fusion demands significant compute for fine-tuning.
- Data Management: FusionBench's tabular format (task, query, LLM, response, metrics) facilitates scalable experimentation and integration with retrieval systems.
- Deployment: Query-level routers can be integrated into LLM serving platforms for dynamic model selection. Thought-level fusion can be implemented as a retrieval-augmented prompting layer. Model-level fusion is best suited for periodic model updates rather than online inference.
Limitations and Future Directions
The paper notes that fusion may introduce noise or degrade performance in domains requiring deterministic answers (e.g., world knowledge, math). There is also a risk of overfitting or negative transfer in model-level fusion. Future work should address trustworthiness, bias, and calibration in the fusion process, and explore safeguards for robust and equitable outcomes.
Implications for AI Research and Practice
This work demonstrates that routing data is a valuable resource for capability fusion, not just model selection. The multi-level framework enables practitioners to tailor fusion strategies to specific application requirements, balancing performance, cost, and interpretability. The release of FusionBench and FusionFactory provides a foundation for systematic research on LLM integration, with direct applicability to multi-agent systems, ensemble methods, and retrieval-augmented generation.
Speculation: As LLM ecosystems continue to diversify, multi-level fusion frameworks leveraging routing data will become increasingly important for maximizing utility, efficiency, and robustness in real-world AI deployments. The modularity of FusionFactory suggests extensibility to other modalities (e.g., vision, speech) and integration with agent-based architectures.