Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 84 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 86 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Kimi K2 229 tok/s Pro

2000 character limit reached

Fusing LLM Capabilities with Routing Data (2507.10540v1)

Published 14 Jul 2025 in cs.LG

Abstract: The rapid advancement of LLMs has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.

Collections

Summary

The paper presents a multi-level fusion framework that integrates LLM routing data at query, thought, and model levels to optimize performance and cost.
Empirical results demonstrate up to 16% reward improvement and significant gains in reasoning tasks, highlighting the effectiveness of thought-level fusion.
The study reveals practical trade-offs where query-level fusion offers cost-efficient deployment while model-level fusion faces overfitting risks on heterogeneous tasks.

Fusing LLM Capabilities with Routing Data: A Multi-Level Framework for LLM Integration

The paper "Fusing LLM Capabilities with Routing Data" (2507.10540) addresses the underutilization of large-scale LLM routing data in the context of multi-model integration. The authors introduce FusionBench, a comprehensive benchmark for LLM routing and fusion, and FusionFactory, a systematic framework for fusing LLM capabilities at three distinct levels: query, thought, and model. The work is grounded in the observation that LLM hosting platforms routinely collect rich routing data—user queries and corresponding responses from multiple LLMs—which encodes valuable information about the comparative strengths and weaknesses of different models across a wide range of tasks.

FusionBench: A Large-Scale Routing Benchmark

FusionBench is constructed to support research on LLM capability fusion by providing:

Diverse Response Patterns: For each query, responses are collected from 20 open-source LLMs (ranging from 8B to 671B parameters) using both direct and reasoning-augmented prompting. This results in a dataset of 103M tokens covering 14 tasks across six domains (Math, Code, Commonsense Reasoning, World Knowledge, Reading Comprehension, and Popular Knowledge).
Reusable Thought Templates: For each query, the top-k responses (by performance or LLM-judge score) are summarized into abstract thought templates, enabling mid-level fusion strategies.
Comprehensive Evaluation Signals: Each response is annotated with task-specific metrics, token cost estimates, and LLM-judge scores that assess not only correctness but also reasoning quality and suitability for supervised fine-tuning.

This benchmark enables systematic exploration of LLM fusion strategies and provides a reproducible foundation for future research.

FusionFactory: Multi-Level LLM Fusion Framework

FusionFactory operationalizes LLM fusion at three representative levels, each corresponding to a different stage in the LLM inference pipeline:

1. Query-Level Fusion (Early Fusion)

At this level, a router is trained to select the most appropriate LLM for each incoming query, optimizing a reward function that balances performance, cost, and LLM-judge quality. The router can be implemented using various architectures (e.g., KNN, SVM, MLP, BERT, or graph-based models like GraphRouter). The reward function is parameterized as:

1	Reward = α * Performance - β * Cost + γ * LLM-Judge

Empirical results show that advanced routers (notably GraphRouter) consistently outperform the best single LLM and even the largest LLMs, achieving up to 16% improvement in reward and 5.7% gain in LLMscore under LLM-judge evaluation. Query-level fusion is also the most cost-efficient, requiring only lightweight router training.

2. Thought-Level Fusion (Mid Fusion)

This level leverages the diversity of reasoning patterns across LLMs by constructing thought templates—generalized reasoning strategies distilled from top-performing responses. For a new query, similar past queries are retrieved via embedding similarity, and their thought templates are used as few-shot demonstrations to guide the LLM's response.

Key findings include:

Substantial Performance Gains: Thought-level fusion yields the largest improvements, especially on reasoning-intensive tasks (e.g., +21.3% in math for small models, +57.6% in code for large models).
Hybrid Selection Strategy: Combining performance and LLM-judge criteria for selecting demonstration responses provides the best trade-off between accuracy and reasoning quality.
Diminishing Returns: Increasing the size of the summarizer or the number of responses does not always yield further gains, but does increase computational cost.

3. Model-Level Fusion (Late Fusion)

Model-level fusion is implemented via supervised fine-tuning (SFT) of a base LLM on a dataset constructed from the top-k responses (by performance or LLM-judge score) for each query. This approach is analogous to knowledge distillation and imitation learning.

Empirical observations:

Modest Average Improvements: Model-level fusion provides gains in 4 out of 6 domains but can underperform in code tasks due to domain mismatch and data scarcity.
LLM-Judge Supervision: Using LLM-judge scores for response selection yields the best results, as it provides a more nuanced assessment than binary task metrics.
Overfitting Risk: Training on multiple responses per query can lead to overfitting and reduced generalization, especially when fusing heterogeneous tasks.

Comparative Analysis and Practical Implications

A cross-level comparison reveals that:

Thought-Level Fusion Achieves the Best Overall Performance: Especially when using hybrid selection and large summarizers, thought-level fusion outperforms both query- and model-level approaches across most domains.
Query-Level Fusion Offers the Best Efficiency-Performance Trade-off: It is highly practical for deployment, requiring minimal additional computation beyond router training.
Model-Level Fusion is Least Effective: It is susceptible to overfitting and struggles with heterogeneous task requirements.

Domain-specific analysis indicates that fusion is most challenging in World Knowledge and Math, where factual accuracy and strict logical consistency are paramount and multi-model collaboration can introduce noise.

Implementation Considerations

Computational Requirements: FusionBench and FusionFactory require access to multiple LLMs and significant compute for data collection, template generation, and fine-tuning. However, query-level fusion can be deployed with minimal overhead.
Router Training: For practical deployment, lightweight routers (e.g., MLPs or compact transformers) can be trained on routing logs, leveraging embeddings from models like MiniLM or Longformer.
Template Summarization: Thought templates can be generated using a strong LLM (e.g., LLaMA-3 70B) and stored for efficient retrieval during inference.
Fine-Tuning: Model-level fusion should be approached with caution, using careful selection of training data and regularization to mitigate overfitting.

Broader Implications and Future Directions

This work demonstrates that large-scale routing data is a valuable resource for both model selection and capability fusion. The multi-level framework enables practitioners to flexibly integrate diverse LLMs, optimizing for performance, cost, and reasoning quality according to application requirements.

Potential future directions include:

Trustworthiness and Bias Mitigation: Investigating how fusion strategies affect the reliability and fairness of LLM outputs, and developing safeguards against the propagation of biases or errors.
Dynamic and Adaptive Fusion: Exploring online learning and reinforcement learning approaches for adaptive router and template selection in changing environments.
Extension to Multimodal and Multilingual Settings: Applying the framework to settings involving vision-LLMs or cross-lingual tasks.

The release of FusionBench and FusionFactory provides a foundation for systematic research on LLM integration, with direct applicability to real-world LLM hosting platforms, multi-agent systems, and cost-sensitive AI deployments.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

Tweets

https://twitter.com/fly51fly/status/1945236315408126443

https://twitter.com/taofeng_uiuc/status/1945544748627866022

YouTube

Show All Videos

alphaXiv

Fusing LLM Capabilities with Routing Data (15 likes, 0 questions)