Model Merging Benchmark

Updated 7 December 2025

Model merging benchmarks are standardized evaluation suites designed to quantify how effectively merging algorithms integrate multiple specialized models into a single multitask model without additional gradient-based training.
They span various domains such as language, vision, graphs, and multimodal tasks, employing fixed data splits, uniform metrics, and rigorous protocols to ensure comparability.
These benchmarks drive research by enabling detailed assessment of methods—including coefficient-based, sparsification, SVD-based, and optimization techniques—to enhance model robustness and efficiency.

Model merging benchmarks are standardized evaluation suites that measure the effectiveness of algorithms for integrating the knowledge or capabilities of multiple models into a single multitask model, typically without additional gradient-based training. These benchmarks have emerged as crucial instruments for the empirical paper and optimization of merging methods, enabling rigorous, reproducible, and comparable assessment across architectures, domains, and objectives. They span language, vision, graph, and multimodal domains, enforce consistent data splits, metrics, and protocols, and increasingly drive the field toward robust, application-relevant model composition.

1. Scope and Core Principles

A model merging benchmark is designed to systematically quantify the ability of merging algorithms to produce a generalist model capable of serving multiple downstream tasks. Benchmarks typically address:

Domains: Language (e.g., reasoning, instruction following, coding, safety), vision (classification, segmentation, OCR), graphs, and multimodal (VQA, audio, video).
Models: Fine-tuned, parameter-efficient (e.g., LoRA, IA³), Mixture-of-Experts, or checkpoint-averaged LLMs.
Performance metrics: Multi-task accuracy, normalized performance (relative to per-task finetuning), knowledge retention (catastrophic forgetting), Pareto efficiency, and runtime efficiency.
Protocols: Standardized data splits, evaluation of both in-domain and generalization to unseen tasks, and inclusion of robustness and ablation analyses.

A key design principle is ensuring comparability by fixing base architectures, task splits, and evaluation pipelines, as realized in MergeBench (2505.10833), FusionBench (Tang et al., 5 Jun 2024), and the DELLA-Merging benchmark (Deep et al., 17 Jun 2024).

2. Representative Benchmarks: Suites, Protocols, Evaluation

Benchmarks generally specify the following components:

Expert Model Suite: Domain-specialized models fine-tuned from a shared base (e.g., LLaMA/Gemma, CLIP, GPT-2, T5). For example, DELLA-Merging (Deep et al., 17 Jun 2024) uses WizardLM (LM), WizardMath (math), and LLaMA-2-13b-code-alpaca (code).

Tasks and Datasets:

Benchmark Suite	Models/Backbone	Domains	Representative Tasks	Metric
DELLA	LLaMA-2-13B	LM, Math, Code	AlpacaEval, GSM8K, MBPP	GPT-4 win-rate, Pass@1
MergeBench	Llama/Gemma 2B–9B	IF, Math, Multi., Code, Safety	IFEval, MMLU, Magicoder, Aya, GRPO	Normalized Accuracy
SMM-Bench	Shisa-Gamma, WizardMath	Reasoning	gsm8k-ja, MGSM (JA)	Accuracy
FusionBench	CLIP, GPT-2, T5	Vision, Text	SUN397, GLUE, NYUv2, TextVQA	Top-1, mIoU, Acc.

Evaluation protocols enforce uniform batching, dropout, and inference settings, such as greedy decoding and maximum token limits (e.g., DELLA-Merging, max tokens=512).

3. Merging Methods: Baselines and Advanced Algorithms

Benchmarks comprehensively assess a spectrum of model merging algorithms, enforcing standardized settings:

Coefficient-based: Simple average (“Model Soup”), Task Arithmetic [TA]: $\theta_{\text{merge}} = \theta_0 + \lambda \sum \tau_i$ (2505.10833).
Sparsification-based: TIES (top-K pruning by magnitude + sign consensus), DARE (random drop, rescale), DELLA (magnitude-based probabilistic drop with rescaling) (Deep et al., 17 Jun 2024).
Subspace/SVD-based: RegMean, Fisher Merging, Task Singular Vector Merging (TSV), Iso-C, KnOTS (LoRA update alignment via joint SVD) (Stoica et al., 25 Oct 2024).
Multi-objective/Preference-aware: Pareto Merging (Chen et al., 22 Aug 2024) directly produces a Pareto set of solutions for different application tradeoffs.
Optimization-based: WUDI, OptMerge (per-layer loss-minimization over task vector subspace) (Wei et al., 26 May 2025), as well as automated layerwise/search-based fusion (Su et al., 2025, (Su et al., 6 Feb 2025)).

Recent benchmarks also include methods tailored to specific settings, e.g. Twin-Merging for modular input-adaptive merges (Lu et al., 17 Jun 2024), GNNMerge for non-shared initialization GNNs (Garg et al., 5 Mar 2025), and FlexMerge for block-wise progressive merging with accuracy–size trade-off (Dhasade et al., 29 May 2025).

4. Hyperparameter Optimization and Surrogate Benchmarks

Many merging methods display significant sensitivity to hyperparameters (e.g. drop rate $p$ in DARE/DELLA, scaling $\lambda$ in TA, TIES pruning ratio $k$ ). Manual grid search is standard in smaller benchmarks (e.g., DELLA (Deep et al., 17 Jun 2024); grid $p\in\{0.1,0.3,\dots\}$ , $\lambda\in[0.5,1.5]$ ).

SMM-Bench (Akizuki et al., 2 Sep 2025) introduces formal continuous and mixed search spaces for hyperparameter tuning in model merging, including

Parameter Space (PS): 64-dimensional continuous layerwise weights for task arithmetic.
Data-Flow Space (DFS): 32 categorical + 63 continuous dimensions (layer insertions + scaling).

Paired datasets of merging configurations and test accuracies enable the construction of high-fidelity LightGBM surrogates (average $R^2>0.92$ , Kendall $\tau>0.79$ ), accurately simulating the performance landscape for hyperparameter optimization algorithms. This facilitates rapid evaluation of HPO methods (CMA-ES, TPE, random search) at orders-of-magnitude lower compute.

5. Quantitative Results and Comparative Analysis

Benchmarks report absolute and relative performance with respect to (a) the best individual expert ("oracle"), (b) the base pretrained model, and (c) the best baseline merging method. Key summary results include:

Method	Setting	Multi-task Score (%)	Forgetting (%)	Efficiency
DELLA	LLaMA-2-13B	58.2 (avg)	N/A	Greedy
TA (no pruning)	LLaMA-2-13B	46.5 (avg)	N/A	Greedy
TIES	LLaMA-2-13B	54.1 (avg)	N/A	Greedy
Fisher	CLIP-ViT-B/32	70.6	60.6*	Fastest
RegMean	CLIP-ViT-B/32	80.5	65.7*	Moderate
Localize-Stitch	MergeBench	90–105 (normalized)	>100% retained	Moderate
Layerwise Ada	CLIP/T5	82.6–88.5	N/A	Slower
KnOTS-TIES	ViT/LLaMA-3	68.0–92.9	N/A	SVD-based
Twin-Merging	Qwen, GLUE	Up to 102.4	Retains FT/SOTA	Adaptive
Pareto Merging	ViT	85.2 (best)	Transfer improves	One-shot
OptMerge	InternVL/Qwen	56.8+	N/A	$>$ 5–100 $\times$ faster than SFT

*Unseen tasks

Several findings have emerged:

Task Arithmetic remains the only method that reliably yields constructive interference in general LLM settings (Hitit et al., 26 Nov 2025).
Methods employing magnitude pruning with scaling, e.g. DELLA (Deep et al., 17 Jun 2024), deliver substantial gains (+11.1 points over TA, +3.6 over TIES, +1.2 over DARE), especially on disjoint multi-domain merges.
SVD-based KnOTS improves LoRA merging by up to 4.3% and enables generalization to joint-task settings (Stoica et al., 25 Oct 2024).
FlexMerge demonstrates that modest increases in model size over the 1 $\times$ baseline (e.g., 2–3 $\times$ ) recover near-finetuned accuracy over 30 domains (Dhasade et al., 29 May 2025).
Multi-objective benchmarks (Pareto Merging (Chen et al., 22 Aug 2024)) enable user-preference trade-offs and strong transfer to unseen tasks (e.g., +4.3% over AdaMerging in transfer).
Optimization-augmented merges (OptMerge, WUDI) demonstrate robust capability unification for multi-modal LLMs (Wei et al., 26 May 2025).

6. Specialized Benchmarks: Reasoning, Multimodal, Graph

Recent model merging benchmarks expand the settings:

Tunable Reasoning: “Thinking Spectrum” (Lan et al., 26 Sep 2025) quantifies the continuum of accuracy–efficiency trade-offs by varying the merge strength α between "direct" and "thinking" models, demonstrating instances of Pareto improvement where merged models are both more accurate and more efficient than any parent.
Multimodal and Cross-domain: OptMerge’s MLLM-Merging Benchmark (Wei et al., 26 May 2025) combines vision, audio, and video modalities into a single LLM, using both LoRA- and full-tuning experts; the merged models match or exceed multi-task fine-tuning performance while being over 5–100× faster to create.
Graph Neural Networks: GNNMerge (Garg et al., 5 Mar 2025) proposes the first task-agnostic graph model merging benchmark, with analytical closed-form solutions for embedding alignment, yielding up to +24% accuracy improvement over competing baselines and $>$ 100 $\times$ speedup.

7. Current Limitations and Directions

Model merging benchmarks are essential for practical evaluation but have several outstanding limitations:

Computational scalability: Although faster than retraining, some methods (e.g., Fisher, mask training) remain costly on large models (2505.10833).
Heterogeneous/federated merging: Extending benchmarks beyond isomorphic architectures or across private models (heterogeneous, decentralized/federated scenarios) remains challenging (Zhang et al., 17 Oct 2024).
Robustness and alignment: Current benchmarks expand to multi-objective and alignment tasks—in particular, RESM achieves state-of-the-art 3H (Helpfulness, Honesty, Harmlessness) tradeoffs via outlier-aware, rank-adaptive SVD merges (Yang et al., 8 Feb 2025).
Theory–practice gap: Empirical results show many subspace/interference-aware methods degrade on large LLMs due to violation of low-rank/orthogonality assumptions (Hitit et al., 26 Nov 2025). The field is moving towards fine-grained automated search spaces (layerwise fusion, dynamic routing (Su et al., 6 Feb 2025, Lu et al., 17 Jun 2024)), surrogate-based optimization (Akizuki et al., 2 Sep 2025), and Pareto-front delivery (Chen et al., 22 Aug 2024).

Benchmark Availability

Code and reproducibility artifacts are available for most major benchmarks:

MergeBench: https://github.com/uiuctml/MergeBench (2505.10833)
FusionBench: https://github.com/tanganke/fusion_bench (Tang et al., 5 Jun 2024)
DELLA: https://github.com/declare-lab/della (Deep et al., 17 Jun 2024)
Twin-Merging: https://github.com/LZY-the-boys/Twin-Merging (Lu et al., 17 Jun 2024)
Pareto Merging: See paper for code (Chen et al., 22 Aug 2024)
KnOTS: https://github.com/gstoica27/KnOTS (Stoica et al., 25 Oct 2024)
OptMerge: See original paper (Wei et al., 26 May 2025)

References

“DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling” (Deep et al., 17 Jun 2024)
“MergeBench: A Benchmark for Merging Domain-Specialized LLMs” (2505.10833)
“FusionBench: A Comprehensive Benchmark of Deep Model Fusion” (Tang et al., 5 Jun 2024)
“Fine, I'll Merge It Myself: A Multi-Fidelity Framework for Automated Model Merging” (Su et al., 6 Feb 2025)
“A Systematic Study of Model Merging Techniques in LLMs” (Hitit et al., 26 Nov 2025)
“Surrogate Benchmarks for Model Merging Optimization” (Akizuki et al., 2 Sep 2025)
“The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging” (Lan et al., 26 Sep 2025)
“Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging” (Chen et al., 22 Aug 2024)
“Unconstrained Model Merging for Enhanced LLM Reasoning” (Zhang et al., 17 Oct 2024)
“OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging” (Wei et al., 26 May 2025)
“GNNMerge: Merging of GNN Models Without Accessing Training Data” (Garg et al., 5 Mar 2025)
“Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging” (Lu et al., 17 Jun 2024)
“Model merging with SVD to tie the Knots” (Stoica et al., 25 Oct 2024)
“Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of LLM via Model Merging” (Yang et al., 8 Feb 2025)
“Model Merging in Pre-training of LLMs” (Li et al., 17 May 2025)
“Navigating the Accuracy-Size Trade-Off with Flexible Model Merging” (Dhasade et al., 29 May 2025)

These resources collectively define the state-of-the-art in model merging evaluation and provide the core reference point for comparative empirical research in the field.