Papers
Topics
Authors
Recent
Search
2000 character limit reached

FuseChat-3.0: Efficient LLM Fusion

Updated 14 May 2026
  • FuseChat-3.0 is a suite of compact LLMs that fuses expertise from heterogeneous source models to deliver high performance across instruction, math, coding, and multilingual tasks.
  • The methodology employs a dual-stage training combining supervised fine-tuning (SFT) and direct preference optimization (DPO) with advanced reward-guided sampling and length normalization.
  • Benchmark evaluations show significant gains in instruction-following and overall metrics, reducing inference redundancy by implicitly merging multiple model strengths.

FuseChat-3.0 is a suite of LLMs constructed by compactly integrating the expertise of multiple heterogeneous source models into smaller, more efficient targets. Through a dual-stage training methodology that leverages high-quality multi-source data and advanced preference-guided optimization, FuseChat-3.0 achieves state-of-the-art results across a range of benchmarks encompassing instruction following, general knowledge, mathematics, and coding. The approach aims to eliminate the overhead of multi-model inference by “implicitly” fusing the capabilities of large, complementary pretrained models into a single, deployable architecture while retaining most of their diverse strengths (Yang et al., 6 Mar 2025).

1. Model Selection and Fusion Protocol

FuseChat-3.0 utilizes diverse, high-capacity source LLMs, each excelling in distinct domains:

The core fusion strategy involves:

  • Generating multi-source outputs: For a curated set of prompts, each source model produces multiple candidate responses.
  • Supervised Fine-Tuning (SFT): The target LLM is fine-tuned using the highest-reward response from the ensemble for each prompt.
  • Direct Preference Optimization (DPO): Targets are further refined using intra-model best/worst response pairs, applying preference constraints to distill reward-aligned behaviors observed within each source.

This protocol achieves an “implicit” fusion—transferring complementary expertise of the sources (e.g., translation, code generation) into a single, compact model, thereby preventing inference-time redundancy and computational overhead (Yang et al., 6 Mar 2025).

2. Data Construction and Preference Scheme

A meticulous data pipeline tailors construction to cover a broad distribution of tasks and languages:

  • Prompt Coverage (total ≈ 160K):
    • Instruction following: 80,907 prompts (from UltraFeedback, Magpie-Pro-DPO, HelpSteer2, filtering code/math cases)
    • Mathematics: ≈52K (from OpenMathInstruct-2)
    • Coding: 16,005 (from LeetCode and Self-Oss-Instruct-SC2, validated by test cases)
    • Chinese: ≈10K (from Alpaca-GPT4-Zh and Magpie-Qwen2-Pro-Zh, excluding code/math)
  • Response Sampling:
    • For each prompt-source pair: 5 samples (instruction/math), 8 samples (coding), and Qwen-2.5-72B-Instruct specifically for Chinese.
    • Sampling in vLLM: temperature 0.7–0.8, top-p 0.8–0.95, repetition penalty 1.05.
  • Preference Pair Construction:
    • Each response is scored by ArmoRM-LLaMA3-8B-v0.1 (instruction) or by rule-based correctness plus reward model for math/coding.
    • SFT: The highest-scoring response per prompt across all sources.
    • DPO: For each source, best/worst intra-model pairs with RM-score gap (0.01–0.1).
    • Final dataset: 94,539 SFT samples, 64,128 DPO pairs.

This controlled, reward-guided data regime supports robust knowledge transfer and suppresses negative reward-style bias (Yang et al., 6 Mar 2025).

3. Training Mechanism and Optimization

The pipeline comprises two major stages:

3.1. Supervised Fine-Tuning (SFT)

Target models are first optimized toward high-quality outputs with a token-level causal LM objective: LSFT(θ)=E(x,y)DSFT[t=1ylogpθ(yty<t,x)]\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\mathrm{SFT}}} \left[ \sum_{t=1}^{|y|} \log p_\theta(y_t \mid y_{<t}, x) \right] Key hyperparameters (all targets): 3 epochs, batch size 128, max sequence length 2048, optimizer with cosine LR decay and 10% warmup, model-specific learning rates.

3.2. Direct Preference Optimization (DPO)

Preference optimization is applied using a Bradley–Terry policy loss, shaping behaviors through best/worst pairs: LDPO(θ)=E(x,y+,y)DDPO[logσ(sθ(x,y+)sθ(x,y))]\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim \mathcal{D}_{\mathrm{DPO}}} \left[ \log \sigma \left( s_\theta(x, y^+) - s_\theta(x, y^-) \right) \right] where sθ(x,y)=β[logπθ(yx)logπref(yx)]s_\theta(x, y) = \beta \left[ \log \pi_\theta(y \mid x) - \log \pi_{\mathrm{ref}}(y \mid x) \right].

A length-normalized DPO variant further corrects verbosity-induced biases: LLN-DPO(θ)=E(x,y+,y)[logσ(βy+Δ+βyΔ)]\mathcal{L}_{\mathrm{LN\text{-}DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[ \log \sigma \left( \frac{\beta}{|y^+|}\Delta^+ - \frac{\beta}{|y^-|}\Delta^- \right) \right] with Δ±=logπθ(y±x)logπref(y±x)\Delta^\pm = \log \pi_\theta(y^\pm \mid x) - \log \pi_{\mathrm{ref}}(y^\pm \mid x).

DPO uses a single epoch, batch size 128, maintaining the post-SFT model as reference and per-model β, LR settings (Yang et al., 6 Mar 2025).

4. Benchmarking and Empirical Analysis

Evaluation spans fourteen standard benchmarks:

Benchmark Base +SFT FuseChat-3.0 Gain vs Base
AlpacaEval-2 28.3 41.3 65.4 +37.1
Arena-Hard 28.1 38.7 58.2 +30.1
GSM8K (CoT) 85.9 87.0 88.0 +2.1
HumanEval 69.5 69.5 71.3 +1.8
Average (14) 40.5 43.2 47.3 +6.8

FuseChat-3.0, when instantiated on Llama-3.1-8B-Instruct, exhibits an average improvement of 6.8 points across all benchmarks, with pronounced gains in instruction-following (37.1 on AlpacaEval-2, 30.1 on Arena-Hard) over the base architecture. Compared to Tülu 3 (Llama-3.1-8B), FuseChat-3.0 achieves 47.3 vs 40.2 (+7.1 overall), especially excelling on instruction-following tasks (+11.4). Length-normalized DPO further yields +1.1 average improvement versus vanilla DPO (Yang et al., 6 Mar 2025).

5. Architectural and Implementation Considerations

  • FuseChat-3.0 preserves the original transformer architectures of its target models, introducing no new layers or modules.
  • Parameter counts coincide with targets: 8B, 9B, 7B, 3B, 1B.
  • SFT leverages Llama-Factory; DPO employs the alignment-handbook, both via HuggingFace.
  • Inference is performed using vLLM.
  • Training was conducted on 8–16 A100 GPUs, with each stage requiring approximately one day.

All code, datasets, and model weights are accessible at https://github.com/SLIT-AI/FuseChat-3.0 (Yang et al., 6 Mar 2025).

6. Analysis, Limitations, and Future Directions

The effectiveness of FuseChat-3.0 is attributed to:

  • SFT-based implicit fusion: Alignment to high-quality, multi-source outputs.
  • Reward-model-guided DPO: Intra-model preference pairs reduce style-conformance bias/variance versus inter-model pairings.
  • Length normalization: Corrects verbosity bias typical in DPO for long outputs.

Limitations and challenges include:

  • Minor regression in code-generation for ultra-compact targets, likely due to lower coding data representation.
  • Absence of a DPO phase for Chinese language tasks—future work may develop specialized reward models for improved multilingual preference optimization.
  • Open questions pertain to scalability across larger numbers of sources, robust multilingual fusion, and generalized multi-task scenarios.

Plausible extensions comprise:

  • Incorporation of human-annotated preferences, potentially hybridizing with RLHF protocols.
  • Extension of the fusion paradigm to multi-modal domains, including vision and speech.
  • Dynamic, reward-weighted source sampling in DPO to refine knowledge transfer efficacy.

A plausible implication is that this two-stage fusion protocol, leveraging carefully constructed multi-source data and intra-model preference feedback, provides a generalizable blueprint for scalable, efficient, and high-performing LLM distillation and fusion (Yang et al., 6 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FuseChat-3.0.