Fusing LLM Capabilities with Routing Data (2507.10540v1)

Published 14 Jul 2025 in cs.LG

Abstract: The rapid advancement of LLMs has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.

Summary

The paper presents a multi-level fusion framework that integrates query, thought, and model-level strategies to harness LLM routing data.
FusionBench, built from 103M tokens over 14 tasks and 20 models, underpins the methodology by providing diverse routing and reasoning-augmented responses.
Empirical results demonstrate up to 16% improvement in routing, 21.3% in math, and 57.6% in code tasks, validating the framework’s practical benefits.

Fusing LLM Capabilities with Routing Data: A Multi-Level Fusion Framework

The paper "Fusing LLM Capabilities with Routing Data" (2507.10540) addresses the underutilization of large-scale LLM routing data in the context of multi-model integration. The authors introduce FusionBench, a comprehensive benchmark for LLM routing and fusion, and FusionFactory, a systematic framework for fusing LLM capabilities at three distinct levels: query, thought, and model. The work is grounded in the observation that LLM hosting platforms routinely generate rich routing data—mapping queries to responses from diverse models—yet this data is rarely leveraged for capability fusion beyond simple routing.

FusionBench: A Large-Scale Routing Benchmark

FusionBench is constructed from 103M tokens, covering 14 tasks across six domains (Math, Code, Commonsense Reasoning, World Knowledge, Reading Comprehension, and Popular Knowledge) and 20 open-source LLMs ranging from 8B to 671B parameters. For each query, FusionBench collects both direct (LLM-direct) and reasoning-augmented (LLM-think) responses, and introduces reusable "thought templates" distilled from top-performing LLMs. This design enables the paper of not only which model to route a query to, but also how to synthesize and transfer reasoning strategies across models.

Key features of FusionBench include:

Multiple response patterns per query: Both direct and reasoning-based outputs are collected.
Reusable thought templates: Abstracted reasoning steps are summarized from top-k LLM responses.
Comprehensive coverage: 14 tasks, 20 LLMs, and a large token count enable robust, reproducible experiments.

FusionFactory: Multi-Level LLM Fusion

FusionFactory operationalizes LLM fusion at three levels, each corresponding to a different stage in the reasoning and response pipeline:

1. Query-Level Fusion (Early Fusion)

A router is trained to select the most suitable LLM for each query, optimizing a reward function that balances performance, cost, and LLM-judge scores. The router can be implemented as a neural network (e.g., MLP, BERT-based classifier, or graph-based model like GraphRouter) that takes as input query/task embeddings and LLM features.

Implementation Example:

def route_query(query, task, LLM_features, router_model):
    # Compute embeddings
    query_emb = embed(query)
    task_emb = embed(task)
    # Concatenate features
    features = concatenate([query_emb, task_emb, LLM_features])
    # Predict best LLM
    LLM_idx = router_model.predict(features)
    return LLM_idx

Empirical Results: GraphRouter achieves up to 16% improvement in reward over the best single LLM, and consistently outperforms static baselines across all scales.

2. Thought-Level Fusion (Mid Fusion)

This level leverages thought templates—generalized reasoning patterns distilled from top LLM responses to similar queries. For a new query, the system retrieves the most similar past queries (via embedding similarity), extracts their thought templates, and uses them as few-shot demonstrations to guide the response generation.

Implementation Example:

def generate_with_thoughts(query, thought_db, LLM):
    # Retrieve similar queries
    similar_queries = retrieve_similar_queries(query, thought_db)
    # Extract thought templates
    templates = [thought_db[q]['template'] for q in similar_queries]
    # Construct prompt
    prompt = build_prompt(query, templates)
    # Generate response
    response = LLM.generate(prompt)
    return response

Empirical Results: Thought-level fusion yields the highest average performance, with up to 21.3% improvement in math and 57.6% in code for small and large models, respectively. The hybrid selection strategy (combining performance and LLM-judge scores) provides the best trade-off between accuracy and response quality.

3. Model-Level Fusion (Late Fusion)

Here, the best responses (by performance or LLM-judge) from multiple LLMs are used as supervised fine-tuning data for a target model. This is analogous to knowledge distillation, where the student model learns from the diverse outputs of multiple teachers.

Implementation Example:

def distill_from_responses(train_data, top_k_responses, base_LLM):
    # Prepare SFT dataset: (query, best_response)
    sft_data = [(q, select_top_k_responses(q, responses)) for q, responses in train_data.items()]
    # Fine-tune base LLM
    fine_tuned_LLM = supervised_finetune(base_LLM, sft_data)
    return fine_tuned_LLM

Empirical Results: Model-level fusion provides modest improvements in 4 out of 6 domains, but is less effective than thought-level fusion, especially in code tasks where domain mismatch and data scarcity are problematic. LLM-judge–based selection of responses for distillation outperforms strict metric-based selection.

Comparative Analysis and Practical Implications

A cross-level comparison reveals that:

Thought-level fusion is most effective for tasks requiring complex reasoning (math, code, commonsense), but less so for factual recall (world knowledge).
Query-level fusion is cost-efficient and practical for real-world deployment, requiring only a lightweight router and no model retraining.
Model-level fusion is limited by overfitting and domain heterogeneity, especially when using a single student model for diverse tasks.

Notable empirical findings:

FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks.
The optimal fusion strategy varies by domain and task type.
Gains from fusion are moderate in domains demanding strict factual accuracy or logical consistency.

Implementation Considerations

Computational Requirements: Query-level fusion is lightweight; thought-level fusion requires retrieval and prompt construction; model-level fusion demands significant compute for fine-tuning.
Data Management: FusionBench's tabular format (task, query, LLM, response, metrics) facilitates scalable experimentation and integration with retrieval systems.
Deployment: Query-level routers can be integrated into LLM serving platforms for dynamic model selection. Thought-level fusion can be implemented as a retrieval-augmented prompting layer. Model-level fusion is best suited for periodic model updates rather than online inference.

Limitations and Future Directions

The paper notes that fusion may introduce noise or degrade performance in domains requiring deterministic answers (e.g., world knowledge, math). There is also a risk of overfitting or negative transfer in model-level fusion. Future work should address trustworthiness, bias, and calibration in the fusion process, and explore safeguards for robust and equitable outcomes.

Implications for AI Research and Practice

This work demonstrates that routing data is a valuable resource for capability fusion, not just model selection. The multi-level framework enables practitioners to tailor fusion strategies to specific application requirements, balancing performance, cost, and interpretability. The release of FusionBench and FusionFactory provides a foundation for systematic research on LLM integration, with direct applicability to multi-agent systems, ensemble methods, and retrieval-augmented generation.

Speculation: As LLM ecosystems continue to diversify, multi-level fusion frameworks leveraging routing data will become increasingly important for maximizing utility, efficiency, and robustness in real-world AI deployments. The modularity of FusionFactory suggests extensibility to other modalities (e.g., vision, speech) and integration with agent-based architectures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1945236315408126443

https://twitter.com/taofeng_uiuc/status/1945544748627866022

YouTube

Show All Videos