Papers
Topics
Authors
Recent
2000 character limit reached

FusionBench: Deep Model Fusion Benchmark

Updated 14 February 2026
  • FusionBench is a comprehensive platform that standardizes evaluation protocols for deep model fusion, integrating diverse models and tasks in vision and NLP.
  • It implements 16 distinct fusion strategies—including ensemble, merging, and mixing methods—providing systematic comparisons with robust metrics.
  • The platform enables fair assessment by unifying tasks, models, and fine-tuning protocols, revealing performance gains and limitations such as negative transfer.

FusionBench is a comprehensive benchmarking platform for evaluating deep model fusion techniques, providing standardized, systematic, and extensible infrastructure for measuring the performance, generalization, and robustness of algorithms that combine multiple neural network models. FusionBench covers a broad spectrum of tasks in computer vision and natural language processing, implements a diverse pool of fusion strategies, and offers reproducible protocols and tooling for the rigorous comparison of fusion approaches (Tang et al., 2024).

1. Definition and Motivation

Deep model fusion comprises techniques that combine several pre-trained or fine-tuned neural networks into a single, unified model, exploiting the respective strengths of the component models. Fusion can occur at the level of model predictions ("ensemble") or parameters ("merging" or "mixing"). The formal abstraction is: given NN single-task models {f1,,fN}\{f_1, \ldots, f_N\}, a fusion algorithm A\mathcal{A} produces a composite predictor y=A(x;f1,,fN;w)y = \mathcal{A}(x; f_1, \ldots, f_N; w), with fusion-specific parameters ww.

The need for FusionBench arises from the heterogeneity and inconsistency in prior evaluations of fusion algorithms, which often confound factors (task diversity, model class, fine-tuning practices) and prevent rigorous, fair comparison. FusionBench establishes a standardized suite of tasks, models, and methodology, enabling apples-to-apples benchmarking across the field (Tang et al., 2024).

2. Task Suite and Organization

FusionBench's evaluation suite is partitioned into three major categories spanning vision and NLP domains:

  • Open-Vocabulary Image Classification (8 tasks): Uses datasets such as SUN397, Stanford Cars, RESISC45, EuroSAT, SVHN, GTSRB, MNIST, and DTD. Tasks require recognition across large and open label sets, with robustness measured via eight visual corruption types (e.g., motion blur, impulse noise, JPEG artifacts).
  • Text Classification (7 tasks): Subset of GLUE, including CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST-2. These tasks evaluate standard sentence-level and pairwise classification under accuracy.
  • Text-to-Text Generation (8 tasks): GLUE subset plus STSB. Here, models generate free-form text outputs; metrics include exact-match accuracy for classification (cast as generation) and Spearman's ρ\rho for STSB.

In total, FusionBench covers 26 distinct tasks with 74 fine-tuned models, ensuring broad and balanced coverage of fusion scenarios (Tang et al., 2024).

3. Model Pool and Fine-Tuning Protocols

FusionBench employs a carefully controlled set of input models across multiple architectures, model sizes, and adaptation strategies:

  • Vision Models:
    • CLIP-ViT-B/32 and CLIP-ViT-L/14, with OpenCLIP variants, fine-tuned via full-encoder updates using Adam (learning rate 1×1051\times10^{-5}, $4$k steps, batch size 32).
    • ResNet-50 for NYUv2 scene tasks, fine-tuned with SGD/Adam (learning rate 1×1041\times10^{-4}, 40 epochs, learning rate reduced every 10 epochs).
  • NLP Models:
    • GPT-2 (small): independently fine-tuned for each text classification task (learning rate 5×1055\times10^{-5}, 3 epochs).
    • Flan-T5-Base and Flan-T5-Large: equipped with LoRA adapters for text-to-text tasks; LoRA adapters are "unloaded" after fine-tuning to yield full-parameter models.

By leveraging single-task fine-tuned models as fusion inputs, FusionBench isolates fusion gain from confounding factors in the underlying adaptation (Tang et al., 2024).

4. Fusion Algorithms: Taxonomy and Formalization

FusionBench implements and formalizes 16 fusion techniques, grouped by operational paradigm:

Ensemble Methods

  1. Simple Ensemble: y=1Nifi(x)y = \frac{1}{N}\sum_i f_i(x)
  2. Weighted Ensemble: y=iwifi(x)y = \sum_i w_i f_i(x) (iwi=1\sum_i w_i=1, ww optimized by validation search)
  3. Max-Model Predictor: y=fk(x)y = f_k(x) for k=argmaxiscorei(x)k = \arg\max_i \text{score}_i(x)

Model Merging Methods (parameter-level for isomorphic networks)

  1. Simple Average ("Model Soups"): θ^=1Niθi\hat\theta = \frac{1}{N}\sum_i \theta_i
  2. Weighted Average: θ^=iwiθi\hat\theta = \sum_i w_i \theta_i, iwi=1\sum_i w_i=1
  3. Fisher Merging: wiFiw_i \propto F_i (Fi=F_i = Fisher information for θi\theta_i); θ^=i(Fi/jFj)θi\hat\theta = \sum_i (F_i/\sum_j F_j)\theta_i
  4. RegMean: wi1/(Fi+λ)w_i\propto 1/(F_i+\lambda), normalized, regularizer λ>0\lambda>0
  5. Task Arithmetic: θ^=θ0+λi=1N(θiθ0)\hat\theta = \theta_0 + \lambda\sum_{i=1}^N(\theta_i - \theta_0)
  6. Ties-Merging: θ^=θ0+iλi(θiθ0)\hat\theta = \theta_0 + \sum_i \lambda_i (\theta_i - \theta_0), grid-searched λi\lambda_i
  7. Task-Wise AdaMerging: Adapt λi\lambda_i per task at test time with in-domain data
  8. Layer-Wise AdaMerging: Learn λi,\lambda_{i,\ell} per task and layer by adaptation

Model Mixing Methods (structural modification)

  1. Depth Upscaling: Insert random layers between existing layers, followed by joint fine-tuning
  2. MoE-Based Upscaling: Replace feed-forward blocks with sparse MoE, using experts from each θi\theta_i
  3. MoE-Based Merging: Merge experts from input models into a single MoE, retraining the gating
  4. Weight-Ensemble MoE (WEMoE): Ensemble layer weights, per-task gating via adaptation
  5. Model Recombination: Rewire submodules (e.g., attention heads), then fine-tune

These methods span output-level, parameter-level, and structural fusion strategies, enabling systematic benchmarking of fusion paradigms (Tang et al., 2024).

5. Evaluation Protocols and Key Findings

FusionBench executes all experiments on single NVIDIA RTX3090 (24 GB), fusing NN single-task models and evaluating on each task’s test set. Robustness is quantified under eight types of corruptions as per Hendrycks & Dietterich (2019).

Metrics:

  • Image classification: accuracy (%)
  • Segmentation: mean Intersection-over-Union (mIoU), pixel accuracy
  • Depth: absolute and relative error
  • Normals: mean angular error
  • Text classification: accuracy (%)
  • Text generation: exact-match accuracy (%), Spearman's ρ\rho

Key Results:

  • All fusion methods significantly outperform base, unadapted pre-trained models.
  • RegMean and Fisher Merging consistently improve over naive averaging for merging; layer-wise AdaMerging yields highest improvements among merging techniques.
  • WEMoE (a model mixing strategy) matches or surpasses merging, reaching 89%\sim89\% average accuracy on CLIP-ViT-B/32.
  • Large-scale multi-task learning (MTL) outperforms fusion methods on substantial models, indicating a current upper bound for fusion strategies.
  • FusionBench reveals generalization failures, especially with negative transfer on out-of-distribution tasks (e.g., RESISC45).
  • Fused models degrade sharply under strong corruptions, highlighting current robustness limitations and overfitting risk for adaptive approaches.

These results clarify performance gaps and robustness issues, guiding future fusion research (Tang et al., 2024).

6. Platform, Reproducibility, and Community Resources

FusionBench provides:

  • A modular, open-source codebase with unified CLI, YAML-based experiment configuration, and pre-packaged scripts for full reproducibility.
  • Three core modules: Algorithm, Model-Pool, Task-Pool.
  • Default hyperparameter settings for each architecture and adaptation protocol.
  • Prompt templates for all text tasks and detailed documentation, code examples, and tutorials at https://tanganke.github.io/fusion_bench/
  • Facilities to easily plug in custom fusion algorithms, models, or tasks for consistent, controlled comparison.

Guidelines on fine-tuning defaults and experiment management enable transparent benchmarking and reproducibility across the field (Tang et al., 2024).


FusionBench represents a rigorously engineered, extensible foundation for evaluating deep model fusion, providing clear insight into trade-offs, generalization, and robustness in multi-model learning scenarios. Its protocols and findings have already motivated new layer-adaptive fusion techniques reviewed in the subsequent literature, such as LARV (Wang et al., 10 Feb 2026), and remain pivotal in structuring comparative research on model combination and integration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FusionBench.