Papers
Topics
Authors
Recent
2000 character limit reached

Model Merging Benchmark

Updated 7 December 2025
  • Model merging benchmarks are standardized evaluation suites designed to quantify how effectively merging algorithms integrate multiple specialized models into a single multitask model without additional gradient-based training.
  • They span various domains such as language, vision, graphs, and multimodal tasks, employing fixed data splits, uniform metrics, and rigorous protocols to ensure comparability.
  • These benchmarks drive research by enabling detailed assessment of methods—including coefficient-based, sparsification, SVD-based, and optimization techniques—to enhance model robustness and efficiency.

Model Merging Benchmark

Model merging benchmarks are standardized evaluation suites that measure the effectiveness of algorithms for integrating the knowledge or capabilities of multiple models into a single multitask model, typically without additional gradient-based training. These benchmarks have emerged as crucial instruments for the empirical paper and optimization of merging methods, enabling rigorous, reproducible, and comparable assessment across architectures, domains, and objectives. They span language, vision, graph, and multimodal domains, enforce consistent data splits, metrics, and protocols, and increasingly drive the field toward robust, application-relevant model composition.

1. Scope and Core Principles

A model merging benchmark is designed to systematically quantify the ability of merging algorithms to produce a generalist model capable of serving multiple downstream tasks. Benchmarks typically address:

  • Domains: Language (e.g., reasoning, instruction following, coding, safety), vision (classification, segmentation, OCR), graphs, and multimodal (VQA, audio, video).
  • Models: Fine-tuned, parameter-efficient (e.g., LoRA, IA³), Mixture-of-Experts, or checkpoint-averaged LLMs.
  • Performance metrics: Multi-task accuracy, normalized performance (relative to per-task finetuning), knowledge retention (catastrophic forgetting), Pareto efficiency, and runtime efficiency.
  • Protocols: Standardized data splits, evaluation of both in-domain and generalization to unseen tasks, and inclusion of robustness and ablation analyses.

A key design principle is ensuring comparability by fixing base architectures, task splits, and evaluation pipelines, as realized in MergeBench (2505.10833), FusionBench (Tang et al., 5 Jun 2024), and the DELLA-Merging benchmark (Deep et al., 17 Jun 2024).

2. Representative Benchmarks: Suites, Protocols, Evaluation

Benchmarks generally specify the following components:

Expert Model Suite: Domain-specialized models fine-tuned from a shared base (e.g., LLaMA/Gemma, CLIP, GPT-2, T5). For example, DELLA-Merging (Deep et al., 17 Jun 2024) uses WizardLM (LM), WizardMath (math), and LLaMA-2-13b-code-alpaca (code).

Tasks and Datasets:

Benchmark Suite Models/Backbone Domains Representative Tasks Metric
DELLA LLaMA-2-13B LM, Math, Code AlpacaEval, GSM8K, MBPP GPT-4 win-rate, Pass@1
MergeBench Llama/Gemma 2B–9B IF, Math, Multi., Code, Safety IFEval, MMLU, Magicoder, Aya, GRPO Normalized Accuracy
SMM-Bench Shisa-Gamma, WizardMath Reasoning gsm8k-ja, MGSM (JA) Accuracy
FusionBench CLIP, GPT-2, T5 Vision, Text SUN397, GLUE, NYUv2, TextVQA Top-1, mIoU, Acc.

Evaluation protocols enforce uniform batching, dropout, and inference settings, such as greedy decoding and maximum token limits (e.g., DELLA-Merging, max tokens=512).

3. Merging Methods: Baselines and Advanced Algorithms

Benchmarks comprehensively assess a spectrum of model merging algorithms, enforcing standardized settings:

  • Coefficient-based: Simple average (“Model Soup”), Task Arithmetic [TA]: θmerge=θ0+λτi\theta_{\text{merge}} = \theta_0 + \lambda \sum \tau_i (2505.10833).
  • Sparsification-based: TIES (top-K pruning by magnitude + sign consensus), DARE (random drop, rescale), DELLA (magnitude-based probabilistic drop with rescaling) (Deep et al., 17 Jun 2024).
  • Subspace/SVD-based: RegMean, Fisher Merging, Task Singular Vector Merging (TSV), Iso-C, KnOTS (LoRA update alignment via joint SVD) (Stoica et al., 25 Oct 2024).
  • Multi-objective/Preference-aware: Pareto Merging (Chen et al., 22 Aug 2024) directly produces a Pareto set of solutions for different application tradeoffs.
  • Optimization-based: WUDI, OptMerge (per-layer loss-minimization over task vector subspace) (Wei et al., 26 May 2025), as well as automated layerwise/search-based fusion (Su et al., 2025, (Su et al., 6 Feb 2025)).

Recent benchmarks also include methods tailored to specific settings, e.g. Twin-Merging for modular input-adaptive merges (Lu et al., 17 Jun 2024), GNNMerge for non-shared initialization GNNs (Garg et al., 5 Mar 2025), and FlexMerge for block-wise progressive merging with accuracy–size trade-off (Dhasade et al., 29 May 2025).

4. Hyperparameter Optimization and Surrogate Benchmarks

Many merging methods display significant sensitivity to hyperparameters (e.g. drop rate pp in DARE/DELLA, scaling λ\lambda in TA, TIES pruning ratio kk). Manual grid search is standard in smaller benchmarks (e.g., DELLA (Deep et al., 17 Jun 2024); grid p{0.1,0.3,}p\in\{0.1,0.3,\dots\}, λ[0.5,1.5]\lambda\in[0.5,1.5]).

SMM-Bench (Akizuki et al., 2 Sep 2025) introduces formal continuous and mixed search spaces for hyperparameter tuning in model merging, including

  • Parameter Space (PS): 64-dimensional continuous layerwise weights for task arithmetic.
  • Data-Flow Space (DFS): 32 categorical + 63 continuous dimensions (layer insertions + scaling).

Paired datasets of merging configurations and test accuracies enable the construction of high-fidelity LightGBM surrogates (average R2>0.92R^2>0.92, Kendall τ>0.79\tau>0.79), accurately simulating the performance landscape for hyperparameter optimization algorithms. This facilitates rapid evaluation of HPO methods (CMA-ES, TPE, random search) at orders-of-magnitude lower compute.

5. Quantitative Results and Comparative Analysis

Benchmarks report absolute and relative performance with respect to (a) the best individual expert ("oracle"), (b) the base pretrained model, and (c) the best baseline merging method. Key summary results include:

Method Setting Multi-task Score (%) Forgetting (%) Efficiency
DELLA LLaMA-2-13B 58.2 (avg) N/A Greedy
TA (no pruning) LLaMA-2-13B 46.5 (avg) N/A Greedy
TIES LLaMA-2-13B 54.1 (avg) N/A Greedy
Fisher CLIP-ViT-B/32 70.6 60.6* Fastest
RegMean CLIP-ViT-B/32 80.5 65.7* Moderate
Localize-Stitch MergeBench 90–105 (normalized) >100% retained Moderate
Layerwise Ada CLIP/T5 82.6–88.5 N/A Slower
KnOTS-TIES ViT/LLaMA-3 68.0–92.9 N/A SVD-based
Twin-Merging Qwen, GLUE Up to 102.4 Retains FT/SOTA Adaptive
Pareto Merging ViT 85.2 (best) Transfer improves One-shot
OptMerge InternVL/Qwen 56.8+ N/A >>5–100×\times faster than SFT

*Unseen tasks

Several findings have emerged:

  • Task Arithmetic remains the only method that reliably yields constructive interference in general LLM settings (Hitit et al., 26 Nov 2025).
  • Methods employing magnitude pruning with scaling, e.g. DELLA (Deep et al., 17 Jun 2024), deliver substantial gains (+11.1 points over TA, +3.6 over TIES, +1.2 over DARE), especially on disjoint multi-domain merges.
  • SVD-based KnOTS improves LoRA merging by up to 4.3% and enables generalization to joint-task settings (Stoica et al., 25 Oct 2024).
  • FlexMerge demonstrates that modest increases in model size over the 1×\times baseline (e.g., 2–3×\times) recover near-finetuned accuracy over 30 domains (Dhasade et al., 29 May 2025).
  • Multi-objective benchmarks (Pareto Merging (Chen et al., 22 Aug 2024)) enable user-preference trade-offs and strong transfer to unseen tasks (e.g., +4.3% over AdaMerging in transfer).
  • Optimization-augmented merges (OptMerge, WUDI) demonstrate robust capability unification for multi-modal LLMs (Wei et al., 26 May 2025).

6. Specialized Benchmarks: Reasoning, Multimodal, Graph

Recent model merging benchmarks expand the settings:

  • Tunable Reasoning: “Thinking Spectrum” (Lan et al., 26 Sep 2025) quantifies the continuum of accuracy–efficiency trade-offs by varying the merge strength α between "direct" and "thinking" models, demonstrating instances of Pareto improvement where merged models are both more accurate and more efficient than any parent.
  • Multimodal and Cross-domain: OptMerge’s MLLM-Merging Benchmark (Wei et al., 26 May 2025) combines vision, audio, and video modalities into a single LLM, using both LoRA- and full-tuning experts; the merged models match or exceed multi-task fine-tuning performance while being over 5–100× faster to create.
  • Graph Neural Networks: GNNMerge (Garg et al., 5 Mar 2025) proposes the first task-agnostic graph model merging benchmark, with analytical closed-form solutions for embedding alignment, yielding up to +24% accuracy improvement over competing baselines and >>100×\times speedup.

7. Current Limitations and Directions

Model merging benchmarks are essential for practical evaluation but have several outstanding limitations:

  • Computational scalability: Although faster than retraining, some methods (e.g., Fisher, mask training) remain costly on large models (2505.10833).
  • Heterogeneous/federated merging: Extending benchmarks beyond isomorphic architectures or across private models (heterogeneous, decentralized/federated scenarios) remains challenging (Zhang et al., 17 Oct 2024).
  • Robustness and alignment: Current benchmarks expand to multi-objective and alignment tasks—in particular, RESM achieves state-of-the-art 3H (Helpfulness, Honesty, Harmlessness) tradeoffs via outlier-aware, rank-adaptive SVD merges (Yang et al., 8 Feb 2025).
  • Theory–practice gap: Empirical results show many subspace/interference-aware methods degrade on large LLMs due to violation of low-rank/orthogonality assumptions (Hitit et al., 26 Nov 2025). The field is moving towards fine-grained automated search spaces (layerwise fusion, dynamic routing (Su et al., 6 Feb 2025, Lu et al., 17 Jun 2024)), surrogate-based optimization (Akizuki et al., 2 Sep 2025), and Pareto-front delivery (Chen et al., 22 Aug 2024).

Benchmark Availability

Code and reproducibility artifacts are available for most major benchmarks:

References

These resources collectively define the state-of-the-art in model merging evaluation and provide the core reference point for comparative empirical research in the field.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Model Merging Benchmark.