FuseChat-3.0: Efficient LLM Fusion

Updated 14 May 2026

FuseChat-3.0 is a suite of compact LLMs that fuses expertise from heterogeneous source models to deliver high performance across instruction, math, coding, and multilingual tasks.
The methodology employs a dual-stage training combining supervised fine-tuning (SFT) and direct preference optimization (DPO) with advanced reward-guided sampling and length normalization.
Benchmark evaluations show significant gains in instruction-following and overall metrics, reducing inference redundancy by implicitly merging multiple model strengths.

FuseChat-3.0 is a suite of LLMs constructed by compactly integrating the expertise of multiple heterogeneous source models into smaller, more efficient targets. Through a dual-stage training methodology that leverages high-quality multi-source data and advanced preference-guided optimization, FuseChat-3.0 achieves state-of-the-art results across a range of benchmarks encompassing instruction following, general knowledge, mathematics, and coding. The approach aims to eliminate the overhead of multi-model inference by “implicitly” fusing the capabilities of large, complementary pretrained models into a single, deployable architecture while retaining most of their diverse strengths (Yang et al., 6 Mar 2025).

1. Model Selection and Fusion Protocol

FuseChat-3.0 utilizes diverse, high-capacity source LLMs, each excelling in distinct domains:

Sources (27–123B parameters):
- Gemma-2-27B-it
- Mistral-Large-Instruct-2407 (≈24B)
- Qwen-2.5-72B-Instruct
- Llama-3.1-70B-Instruct
Targets (1–9B parameters):
- Llama-3.1-8B-Instruct
- Gemma-2-9B-it
- Qwen-2.5-7B-Instruct
- Llama-3.2-3B-Instruct
- Llama-3.2-1B-Instruct

The core fusion strategy involves:

Generating multi-source outputs: For a curated set of prompts, each source model produces multiple candidate responses.
Supervised Fine-Tuning (SFT): The target LLM is fine-tuned using the highest-reward response from the ensemble for each prompt.
Direct Preference Optimization (DPO): Targets are further refined using intra-model best/worst response pairs, applying preference constraints to distill reward-aligned behaviors observed within each source.

This protocol achieves an “implicit” fusion—transferring complementary expertise of the sources (e.g., translation, code generation) into a single, compact model, thereby preventing inference-time redundancy and computational overhead (Yang et al., 6 Mar 2025).

2. Data Construction and Preference Scheme

A meticulous data pipeline tailors construction to cover a broad distribution of tasks and languages:

Prompt Coverage (total ≈ 160K):
- Instruction following: 80,907 prompts (from UltraFeedback, Magpie-Pro-DPO, HelpSteer2, filtering code/math cases)
- Mathematics: ≈52K (from OpenMathInstruct-2)
- Coding: 16,005 (from LeetCode and Self-Oss-Instruct-SC2, validated by test cases)
- Chinese: ≈10K (from Alpaca-GPT4-Zh and Magpie-Qwen2-Pro-Zh, excluding code/math)
Response Sampling:
- For each prompt-source pair: 5 samples (instruction/math), 8 samples (coding), and Qwen-2.5-72B-Instruct specifically for Chinese.
- Sampling in vLLM: temperature 0.7–0.8, top-p 0.8–0.95, repetition penalty 1.05.
Preference Pair Construction:
- Each response is scored by ArmoRM-LLaMA3-8B-v0.1 (instruction) or by rule-based correctness plus reward model for math/coding.
- SFT: The highest-scoring response per prompt across all sources.
- DPO: For each source, best/worst intra-model pairs with RM-score gap (0.01–0.1).
- Final dataset: 94,539 SFT samples, 64,128 DPO pairs.

This controlled, reward-guided data regime supports robust knowledge transfer and suppresses negative reward-style bias (Yang et al., 6 Mar 2025).

3. Training Mechanism and Optimization

The pipeline comprises two major stages:

3.1. Supervised Fine-Tuning (SFT)

Target models are first optimized toward high-quality outputs with a token-level causal LM objective: $\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\mathrm{SFT}}} \left[ \sum_{t=1}^{|y|} \log p_\theta(y_t \mid y_{<t}, x) \right]$ Key hyperparameters (all targets): 3 epochs, batch size 128, max sequence length 2048, optimizer with cosine LR decay and 10% warmup, model-specific learning rates.

3.2. Direct Preference Optimization (DPO)

Preference optimization is applied using a Bradley–Terry policy loss, shaping behaviors through best/worst pairs: $\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim \mathcal{D}_{\mathrm{DPO}}} \left[ \log \sigma \left( s_\theta(x, y^+) - s_\theta(x, y^-) \right) \right]$ where $s_\theta(x, y) = \beta \left[ \log \pi_\theta(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x) \right]$ .

A length-normalized DPO variant further corrects verbosity-induced biases: $\mathcal{L}_{\mathrm{LN\text{-}DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[ \log \sigma \left( \frac{\beta}{|y^+|}\Delta^+ - \frac{\beta}{|y^-|}\Delta^- \right) \right]$ with $\Delta^\pm = \log \pi_\theta(y^\pm \mid x) - \log \pi_{\mathrm{ref}}(y^\pm \mid x)$ .

DPO uses a single epoch, batch size 128, maintaining the post-SFT model as reference and per-model β, LR settings (Yang et al., 6 Mar 2025).

4. Benchmarking and Empirical Analysis

Evaluation spans fourteen standard benchmarks:

Benchmark	Base	+SFT	FuseChat-3.0	Gain vs Base
AlpacaEval-2	28.3	41.3	65.4	+37.1
Arena-Hard	28.1	38.7	58.2	+30.1
GSM8K (CoT)	85.9	87.0	88.0	+2.1
HumanEval	69.5	69.5	71.3	+1.8
Average (14)	40.5	43.2	47.3	+6.8

FuseChat-3.0, when instantiated on Llama-3.1-8B-Instruct, exhibits an average improvement of 6.8 points across all benchmarks, with pronounced gains in instruction-following (37.1 on AlpacaEval-2, 30.1 on Arena-Hard) over the base architecture. Compared to Tülu 3 (Llama-3.1-8B), FuseChat-3.0 achieves 47.3 vs 40.2 (+7.1 overall), especially excelling on instruction-following tasks (+11.4). Length-normalized DPO further yields +1.1 average improvement versus vanilla DPO (Yang et al., 6 Mar 2025).

5. Architectural and Implementation Considerations

FuseChat-3.0 preserves the original transformer architectures of its target models, introducing no new layers or modules.
Parameter counts coincide with targets: 8B, 9B, 7B, 3B, 1B.
SFT leverages Llama-Factory; DPO employs the alignment-handbook, both via HuggingFace.
Inference is performed using vLLM.
Training was conducted on 8–16 A100 GPUs, with each stage requiring approximately one day.

All code, datasets, and model weights are accessible at https://github.com/SLIT-AI/FuseChat-3.0 (Yang et al., 6 Mar 2025).

6. Analysis, Limitations, and Future Directions

The effectiveness of FuseChat-3.0 is attributed to:

SFT-based implicit fusion: Alignment to high-quality, multi-source outputs.
Reward-model-guided DPO: Intra-model preference pairs reduce style-conformance bias/variance versus inter-model pairings.
Length normalization: Corrects verbosity bias typical in DPO for long outputs.

Limitations and challenges include:

Minor regression in code-generation for ultra-compact targets, likely due to lower coding data representation.
Absence of a DPO phase for Chinese language tasks—future work may develop specialized reward models for improved multilingual preference optimization.
Open questions pertain to scalability across larger numbers of sources, robust multilingual fusion, and generalized multi-task scenarios.

Plausible extensions comprise:

Incorporation of human-annotated preferences, potentially hybridizing with RLHF protocols.
Extension of the fusion paradigm to multi-modal domains, including vision and speech.
Dynamic, reward-weighted source sampling in DPO to refine knowledge transfer efficacy.

A plausible implication is that this two-stage fusion protocol, leveraging carefully constructed multi-source data and intra-model preference feedback, provides a generalizable blueprint for scalable, efficient, and high-performing LLM distillation and fusion (Yang et al., 6 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FuseChat-3.0.

FuseChat-3.0: Efficient LLM Fusion

1. Model Selection and Fusion Protocol

2. Data Construction and Preference Scheme

3. Training Mechanism and Optimization

3.1. Supervised Fine-Tuning (SFT)

3.2. Direct Preference Optimization (DPO)

4. Benchmarking and Empirical Analysis

5. Architectural and Implementation Considerations

6. Analysis, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FuseChat-3.0: Efficient LLM Fusion

1. Model Selection and Fusion Protocol

2. Data Construction and Preference Scheme

3. Training Mechanism and Optimization

3.1. Supervised Fine-Tuning (SFT)

3.2. Direct Preference Optimization (DPO)

4. Benchmarking and Empirical Analysis

5. Architectural and Implementation Considerations

6. Analysis, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research