FuseChat-3.0: Efficient LLM Fusion
- FuseChat-3.0 is a suite of compact LLMs that fuses expertise from heterogeneous source models to deliver high performance across instruction, math, coding, and multilingual tasks.
- The methodology employs a dual-stage training combining supervised fine-tuning (SFT) and direct preference optimization (DPO) with advanced reward-guided sampling and length normalization.
- Benchmark evaluations show significant gains in instruction-following and overall metrics, reducing inference redundancy by implicitly merging multiple model strengths.
FuseChat-3.0 is a suite of LLMs constructed by compactly integrating the expertise of multiple heterogeneous source models into smaller, more efficient targets. Through a dual-stage training methodology that leverages high-quality multi-source data and advanced preference-guided optimization, FuseChat-3.0 achieves state-of-the-art results across a range of benchmarks encompassing instruction following, general knowledge, mathematics, and coding. The approach aims to eliminate the overhead of multi-model inference by “implicitly” fusing the capabilities of large, complementary pretrained models into a single, deployable architecture while retaining most of their diverse strengths (Yang et al., 6 Mar 2025).
1. Model Selection and Fusion Protocol
FuseChat-3.0 utilizes diverse, high-capacity source LLMs, each excelling in distinct domains:
- Sources (27–123B parameters):
- Gemma-2-27B-it
- Mistral-Large-Instruct-2407 (≈24B)
- Qwen-2.5-72B-Instruct
- Llama-3.1-70B-Instruct
- Targets (1–9B parameters):
- Llama-3.1-8B-Instruct
- Gemma-2-9B-it
- Qwen-2.5-7B-Instruct
- Llama-3.2-3B-Instruct
- Llama-3.2-1B-Instruct
The core fusion strategy involves:
- Generating multi-source outputs: For a curated set of prompts, each source model produces multiple candidate responses.
- Supervised Fine-Tuning (SFT): The target LLM is fine-tuned using the highest-reward response from the ensemble for each prompt.
- Direct Preference Optimization (DPO): Targets are further refined using intra-model best/worst response pairs, applying preference constraints to distill reward-aligned behaviors observed within each source.
This protocol achieves an “implicit” fusion—transferring complementary expertise of the sources (e.g., translation, code generation) into a single, compact model, thereby preventing inference-time redundancy and computational overhead (Yang et al., 6 Mar 2025).
2. Data Construction and Preference Scheme
A meticulous data pipeline tailors construction to cover a broad distribution of tasks and languages:
- Prompt Coverage (total ≈ 160K):
- Instruction following: 80,907 prompts (from UltraFeedback, Magpie-Pro-DPO, HelpSteer2, filtering code/math cases)
- Mathematics: ≈52K (from OpenMathInstruct-2)
- Coding: 16,005 (from LeetCode and Self-Oss-Instruct-SC2, validated by test cases)
- Chinese: ≈10K (from Alpaca-GPT4-Zh and Magpie-Qwen2-Pro-Zh, excluding code/math)
- Response Sampling:
- For each prompt-source pair: 5 samples (instruction/math), 8 samples (coding), and Qwen-2.5-72B-Instruct specifically for Chinese.
- Sampling in vLLM: temperature 0.7–0.8, top-p 0.8–0.95, repetition penalty 1.05.
- Preference Pair Construction:
- Each response is scored by ArmoRM-LLaMA3-8B-v0.1 (instruction) or by rule-based correctness plus reward model for math/coding.
- SFT: The highest-scoring response per prompt across all sources.
- DPO: For each source, best/worst intra-model pairs with RM-score gap (0.01–0.1).
- Final dataset: 94,539 SFT samples, 64,128 DPO pairs.
This controlled, reward-guided data regime supports robust knowledge transfer and suppresses negative reward-style bias (Yang et al., 6 Mar 2025).
3. Training Mechanism and Optimization
The pipeline comprises two major stages:
3.1. Supervised Fine-Tuning (SFT)
Target models are first optimized toward high-quality outputs with a token-level causal LM objective: Key hyperparameters (all targets): 3 epochs, batch size 128, max sequence length 2048, optimizer with cosine LR decay and 10% warmup, model-specific learning rates.
3.2. Direct Preference Optimization (DPO)
Preference optimization is applied using a Bradley–Terry policy loss, shaping behaviors through best/worst pairs: where .
A length-normalized DPO variant further corrects verbosity-induced biases: with .
DPO uses a single epoch, batch size 128, maintaining the post-SFT model as reference and per-model β, LR settings (Yang et al., 6 Mar 2025).
4. Benchmarking and Empirical Analysis
Evaluation spans fourteen standard benchmarks:
| Benchmark | Base | +SFT | FuseChat-3.0 | Gain vs Base |
|---|---|---|---|---|
| AlpacaEval-2 | 28.3 | 41.3 | 65.4 | +37.1 |
| Arena-Hard | 28.1 | 38.7 | 58.2 | +30.1 |
| GSM8K (CoT) | 85.9 | 87.0 | 88.0 | +2.1 |
| HumanEval | 69.5 | 69.5 | 71.3 | +1.8 |
| Average (14) | 40.5 | 43.2 | 47.3 | +6.8 |
FuseChat-3.0, when instantiated on Llama-3.1-8B-Instruct, exhibits an average improvement of 6.8 points across all benchmarks, with pronounced gains in instruction-following (37.1 on AlpacaEval-2, 30.1 on Arena-Hard) over the base architecture. Compared to Tülu 3 (Llama-3.1-8B), FuseChat-3.0 achieves 47.3 vs 40.2 (+7.1 overall), especially excelling on instruction-following tasks (+11.4). Length-normalized DPO further yields +1.1 average improvement versus vanilla DPO (Yang et al., 6 Mar 2025).
5. Architectural and Implementation Considerations
- FuseChat-3.0 preserves the original transformer architectures of its target models, introducing no new layers or modules.
- Parameter counts coincide with targets: 8B, 9B, 7B, 3B, 1B.
- SFT leverages Llama-Factory; DPO employs the alignment-handbook, both via HuggingFace.
- Inference is performed using vLLM.
- Training was conducted on 8–16 A100 GPUs, with each stage requiring approximately one day.
All code, datasets, and model weights are accessible at https://github.com/SLIT-AI/FuseChat-3.0 (Yang et al., 6 Mar 2025).
6. Analysis, Limitations, and Future Directions
The effectiveness of FuseChat-3.0 is attributed to:
- SFT-based implicit fusion: Alignment to high-quality, multi-source outputs.
- Reward-model-guided DPO: Intra-model preference pairs reduce style-conformance bias/variance versus inter-model pairings.
- Length normalization: Corrects verbosity bias typical in DPO for long outputs.
Limitations and challenges include:
- Minor regression in code-generation for ultra-compact targets, likely due to lower coding data representation.
- Absence of a DPO phase for Chinese language tasks—future work may develop specialized reward models for improved multilingual preference optimization.
- Open questions pertain to scalability across larger numbers of sources, robust multilingual fusion, and generalized multi-task scenarios.
Plausible extensions comprise:
- Incorporation of human-annotated preferences, potentially hybridizing with RLHF protocols.
- Extension of the fusion paradigm to multi-modal domains, including vision and speech.
- Dynamic, reward-weighted source sampling in DPO to refine knowledge transfer efficacy.
A plausible implication is that this two-stage fusion protocol, leveraging carefully constructed multi-source data and intra-model preference feedback, provides a generalizable blueprint for scalable, efficient, and high-performing LLM distillation and fusion (Yang et al., 6 Mar 2025).