Model Merging in Neural Networks
- Model merging is a technique that combines multiple fine-tuned neural networks into a unified multi-task model without retraining, enabling cross-domain expertise.
- It leverages convex quadratic programming, task vector arithmetic, and geometry-aware methods to optimize merge weights and ensure robust integration.
- The approach reduces storage and inference costs while supporting scalable deployment across vision, language, and multimodal applications.
Model merging is the process of combining several fine-tuned neural network models—each specialized on different tasks or domains—into a single multi-task model, typically without requiring retraining on the original data. The primary objective is to aggregate expertise efficiently, reduce storage and inference costs compared to ensembling, and enable cross-domain generalization. Model merging underpins a broad range of scenarios, from multi-task LLMs to multimodal foundation models in vision, language, and beyond.
1. Mathematical Foundations and Theoretical Guarantees
Model merging typically assumes a shared pre-trained backbone and fine-tuned variants whose parameters are . The merging problem is formalized as searching for a set of weights such that the merged model matches each 's outputs as closely as possible, where is the pointwise residual (Evans et al., 27 May 2026).
A key advance is casting merge-weight selection as a convex quadratic program (QP) over the residuals. The squared-output calibration objective is: where is a calibration set (labels optional). The solution 0 is global and unique due to convexity. This QP subsumes and generalizes many heuristic merges such as model soups (uniform averaging), task arithmetic, TIES, and DARE as special cases (Evans et al., 27 May 2026).
Further theoretical tools include projection diagnostics. The fraction of residual energy captured by a chosen basis, 1, with 2 the total residual-energy matrix, quantifies how much of the multi-task variation is accessible to the merge and robustly predicts merge success (Evans et al., 27 May 2026).
2. Merging Algorithms: Taxonomy and Methodological Advances
A diverse algorithmic ecosystem has crystallized, with methods grouped as follows:
- Weight-Space Averaging: Uniform or greedy averaging of checkpoints (“model soups”) is effective when all models lie in the same contiguous basin of the loss landscape but degrades if fine-tuned directions diverge (Song et al., 10 Mar 2026).
- Task Vector Arithmetic: Merges are performed by summing and scaling task vectors 3, supporting operations such as addition, negation, and scaling (Song et al., 10 Mar 2026, Evans et al., 27 May 2026).
- Sparsification-Enhanced Merging: Methods like TIES enforce parameter or sign sparsity to promote consensus, while DARE applies random drop-out regularization to task vectors, often improving interference control (Evans et al., 27 May 2026, Song et al., 10 Mar 2026).
- Optimal Transport and Alignment: Permutation matching across neurons or optimal transport over parameter alignments remedies architectural symmetries and parameter entanglement (Silva et al., 29 Apr 2026).
- Geometric and Manifold Approaches: Fréchet averaging defines merges via geodesic means on a Riemannian or quotient manifold, yielding algorithms invariant to architectural symmetries (e.g., LoRA adapters as points on a gauge space) (Silva et al., 29 Apr 2026).
- Stochastic and Search-Based Approaches: Mixup Model Merge (M4) draws random interpolation ratios from a Beta distribution, exploring nontrivial merges unattainable by fixed 5 (Zhou et al., 21 Feb 2025). RL-based and evolutionary methods (e.g., Reinforced Model Merging) treat merge configuration search as a Markov decision process (Han et al., 27 Mar 2025).
The selection of merging method and its optimization hyperparameters is commonly tied to assumptions about the geometry of the loss landscape (linear mode connectivity, basin widths), the independence or correlation among fine-tuned updates, and the nature of the task vectors (Song et al., 10 Mar 2026, Evans et al., 27 May 2026, Rahamim et al., 10 Jan 2026).
3. Extensions: Multi-Layer, Modular, and Dynamic Merging Strategies
Layer-wise and modular merging workflows extend merging flexibility and performance:
- Sequential Layer-wise Merging: For deep architectures, independent QPs are solved per layer in a greedy sequence. At each 6, compute residuals 7, solve the local QP, and update only layer 8 (Evans et al., 27 May 2026). This approach addresses non-convex interactions across layers and is highly effective when only the final few layers are fine-tuned.
- Component-Wise and Modular Recombinations: Fine-grained merging decomposes models into submodules—e.g., attention, MLP, normalization layers—and searches for optimal groupings or expert recombination patterns, often using Pareto-front optimization to trade off performance vs. storage (Qiu et al., 6 Feb 2026).
- Dynamic and Input-Conditional Methods: Methods such as SE-Merging adapt merge weights dynamically for each test sample using representation similarity, yielding per-input merged models that exhibit both task separation and instance-wise adaptation without further training (Chen et al., 22 Jun 2025). Slim dynamic frameworks (e.g., DiDi-Merging) leverage differentiable rank allocation in low-rank modules, balancing shared and expert parameters for aggressive storage reduction (Du et al., 17 May 2026).
Task heterogeneity and module-level differences in mergeability necessitate modular or dynamic approaches for efficient and scalable multi-task deployment (Qiu et al., 6 Feb 2026, Du et al., 17 May 2026, Hackmann, 2024).
4. Theoretical and Empirical Limits of Mergeability
Rigorous analyses have identified concrete limits on mergeability:
- Upper Bound on Experts: The total number of experts meaningfully merged is bounded by the effective parameter space and their mutual correlation. For 9 experts with pairwise correlation 0, variance reduction under uniform merging saturates at 1 as 2; thus, marginal benefits diminish strictly as a function of Gaussian width (Wang et al., 27 May 2025).
- Diminishing Returns: Performance gains from adding experts are concave due to geometric constraints of the loss basin, and heavy correlation among task vectors rapidly saturates improvement (Wang et al., 27 May 2025).
- Accuracy-Aware Weighted Merging: Mergeability correlates strongly with the base model’s prior knowledge; knowledge that is easily accessible to the base merges more robustly. Weights should be modulated based on task familiarity to prevent rare or weak tasks from being overwhelmed in the merged model (Rahamim et al., 10 Jan 2026).
Overaggressive merging risks performance collapse due to interference, and monitoring metrics such as the marginal reduction in variance or energy-capture ratio is necessary to identify the optimal merging point (Wang et al., 27 May 2025, Evans et al., 27 May 2026).
5. Geometry, Symmetry, and Invariance in Model Merging
Naïve parameter averaging fails in the presence of architectural symmetries (e.g., neuron permutation, LoRA gauge). Fréchet averaging on manifolds provides a symmetry-invariant, geometry-aware solution (Silva et al., 29 Apr 2026). For low-rank adapters, alignment and averaging must respect the quotient geometry induced by invertible gauge groups, necessitating specialized algorithms (e.g., GeoMerge with Stiefel and SPD metrics).
The choice of geometry (Euclidean, Fisher–Rao, product manifolds) fundamentally shapes the feasible merge space and determines whether statistical or architectural pathologies can be avoided (Silva et al., 29 Apr 2026). Empirically, symmetry-aware merges reliably outperform parameter-space heuristics, especially in highly-adapted or large-scale domains.
6. Practical Considerations, Scalability, and Deployment
Model merging is attractive due to computational and practical efficiencies. Modern methods can:
- Avoid retraining or require only minimal unlabeled calibration data (e.g., 100-shot set).
- Be implemented in a training-free (data-free) fashion by analytically estimating covariance via difference matrices (Hameed et al., 1 Apr 2026) or leveraging statistical alignment between activations and weight updates (Li et al., 13 May 2026).
- Scale to tens of models and hundreds of millions of parameters with favorable computational costs, e.g., 3 per layer for data-free covariance approaches, or 4 memory for Frank–Wolfe scaling (Hameed et al., 1 Apr 2026, Chen et al., 16 Mar 2025).
- Integrate preference-aware and multi-objective optimization to present users with a Pareto set of trade-off solutions accommodating application-specific priorities (e.g., high accuracy on selected tasks, bounded storage) (Chen et al., 2024).
Dynamic routing, evolutionary search, and Bayesian hyperparameter coordination are common tools for practical hyperparameter-free deployment, as is open-source tool support (e.g., MergeKit) (Song et al., 10 Mar 2026).
7. Benchmarks and Empirical Performance
State-of-the-art methods are benchmarked across vision and language domains:
- Vision: On ViT-B/32, the QP-based merge matches or exceeds the performance of all competitors, especially in the challenging multi-task regime (Evans et al., 27 May 2026, Li et al., 13 May 2026).
- Language: LLaMA-based fusions for instruction-following, coding, and math show that output-space QP, BMM, and geo-aware methods yield largest accuracy and robustness improvements; M5 enhances OOD and adversarial performance across all baseline merges (Zhou et al., 21 Feb 2025, Li et al., 13 May 2026).
- Efficiency: Modern merging frameworks often achieve accuracy retention of 98–99% of individual expert models with as little as 1.24x parameter overhead—including in dynamic and storage-constrained settings (Du et al., 17 May 2026).
- Multi-objective: Pareto Merging methods deliver user-controllable trade-offs across all tasks, with a single run generating the spectrum of compromise solutions (Chen et al., 2024).
The collective findings demonstrate that principled model merging—rooted in convex optimization, geometric invariance, and efficient algorithmic design—enables multi-expert composition previously only practical by computationally costly or data-prohibitive methods.
References:
- "Model Merging by Output-Space Projection" (Evans et al., 27 May 2026)
- "Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation" (Zhou et al., 21 Feb 2025)
- "Generalizing the Geometry of Model Merging Through Frechet Averages" (Silva et al., 29 Apr 2026)
- "Why Do More Experts Fail? A Theoretical Analysis of Model Merging" (Wang et al., 27 May 2025)
- "Will it Merge? On The Causes of Model Mergeability" (Rahamim et al., 10 Jan 2026)
- "Bayesian Model Merging" (Li et al., 13 May 2026)
- "Dynamic Model Merging Made Slim" (Du et al., 17 May 2026)
- "SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging" (Chen et al., 22 Jun 2025)
- "Fine-Grained Model Merging via Modular Expert Recombination" (Qiu et al., 6 Feb 2026)
- "Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging" (Chen et al., 2024)
- "Model Merging in the Era of LLMs: Methods, Applications, and Future Directions" (Song et al., 10 Mar 2026)
- "Model Merging via Data-Free Covariance Estimation" (Hameed et al., 1 Apr 2026)
- "FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization" (Chen et al., 16 Mar 2025)
- "Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking" (Chaichana et al., 29 May 2025)
- "From Task-Specific Models to Unified Systems: A Review of Model Merging Approaches" (Ruan et al., 12 Mar 2025)