WSM: Model Merging in Weight Space

Updated 11 June 2026

Model merging in weight space (WSM) is a technique that integrates multiple fine-tuned models into a single multitask model by directly combining their parameter vectors without the need for retraining.
WSM leverages geometric principles like the Fisher–Rao manifold and weighted Karcher means to preserve performance and prevent activation collapse when blending various models.
This approach is applied in multi-task LLMs, federated learning, and adapter fusion, balancing computational efficiency, scalability, and mergeability diagnostics.

Model merging in the weight space (abbreviated as WSM) denotes a family of techniques for integrating multiple fine-tuned or specialized neural network models into a single multitask model by direct manipulation and combination of their parameter vectors, without joint retraining or access to original training data. WSM methods have become pivotal in scenarios such as LLM development, domain specialization, multi-task transfer, and federated learning, owing to their computational efficiency and ability to synthesize capabilities from heterogeneous sources.

1. Core Principles and Geometric Foundations

WSM is formalized over parameter sets $\{\theta^{(i)}\}_{i=1}^N$ , all sharing a common architecture or a compatible alignment. The primary goal is to construct a merged parameter vector $\theta_{\mathrm{merged}}$ such that $f(x; \theta_{\mathrm{merged}})$ retains, and ideally extends, the predictive behaviors of the constituent models on their respective tasks. A central theme in modern WSM is the move from naive linear operations in parameter space to geometric or functional formulations that reflect distances in predictive distributions.

A key development is the formulation of model merging as minimization on the Fisher–Rao (FR) manifold, where the FR distance between models $\theta, \theta'$ is locally equivalent to the KL-divergence between their predictive distributions, i.e.,

$d^2_{\mathrm{FR}}(\theta, \theta') \approx (\theta-\theta')^T F(\theta) (\theta-\theta') \approx 2\,\mathrm{KL}(p_\theta||p_{\theta'}),$

where $F(\theta)$ is the Fisher information matrix. The optimal merge is a weighted Karcher (Fréchet) mean on the manifold:

$\theta^* = \arg\min_{\theta} \sum_{i=1}^{N} \alpha^{(i)} d_{\mathrm{FR}}(\theta, \theta^{(i)})^2$

subject to $\sum_i \alpha^{(i)} = 1$ and $\alpha^{(i)} \ge 0$ (Wang et al., 5 Mar 2026).

2. Major Algorithmic Paradigms

The landscape of WSM encompasses several algorithmic classes:

a. Euclidean and Heuristic Approaches

Early WSM methods operate directly in Euclidean parameter space using linear averaging, task vector arithmetic, and uniform weighting. The general merge equation is:

$\theta_{\mathrm{merged}} = \sum_i \alpha_i \theta_i.$

Variants include per-layer or per-block weights, e.g., as optimized in evolutionary frameworks or via CMA-ES to maximize validation performance (Zhang et al., 2024). More advanced pruning-based heuristics (e.g., TIES, DARE) mask magnitude-insignificant or sign-conflicting parameters to mitigate destructive interference.

b. Geometric and Information-Geometric Approaches

To overcome representation collapse with Euclidean methods, geometric approaches operate on the manifold defined by model functionals. The use of weighted Karcher means on the sphere (as a proxy for the Fisher-Rao manifold) ensures norm-preserving updates and prevents shrinkage of activation variance or effective matrix rank in deep layers, represented by blockwise updates on normalized parameter directions and re-scaling to source norms (Wang et al., 5 Mar 2026).

c. Optimization and Data-Assisted Techniques

These approaches learn per-layer or per-parameter weights by minimizing multitask validation loss:

$\theta_{\mathrm{merged}}$ 0

with gradient-based updates to $\theta_{\mathrm{merged}}$ 1 (Yang et al., 2024), or parameter-wise interpolation coefficients $\theta_{\mathrm{merged}}$ 2 optimized in the presence of small supervised validation sets (Camacho et al., 2024). Convex quadratic programming (QP) over residual updates, using calibration data, yields globally optimal weights that minimize squared-output errors and generalize task arithmetic and soup approaches (Evans et al., 27 May 2026).

d. Dynamic and Modular Recombination

Modern frameworks introduce modularization and dynamic input-aware routers. Components are decomposed into shared and task-exclusive modules, compressed (e.g., via SVD), and a lightweight router predicts the optimal weighting for each input, yielding input-conditioned composite models (Lu et al., 2024, Qiu et al., 6 Feb 2026). Component-wise and Pareto-efficient recombination balances storage and performance via multi-objective search.

e. Bayesian and Covariance-Aware Methods

Bayesian model merging leverages anchor priors and activation-based Bayesian regression, with bi-level optimization (inner: module-wise closed-form MAP estimation; outer: global Bayesian optimization for hyperparameters) (Li et al., 13 May 2026). Data-free variants estimate required Gram matrices from weight differences, obviating the need for calibration data (Hameed et al., 1 Apr 2026).

f. Specialized and Heterogeneous Fusion

Scenarios requiring merging non-identical architectures utilize output distribution alignment, vocabulary mapping, and probabilistic fusion, allowing distributional behaviors to be merged without direct parameter combination (Zhang et al., 2024). Model Assembly Learning extends WSM to zookeeper-style layer-wise assembly with permutation-padded alignment between heterogeneous models (Zhang et al., 27 Mar 2025).

The table below summarizes key classes:

Paradigm	Methodological Principle	Notable Example(s)
Euclidean/Averaging	Linear/interpolative in $\theta_{\mathrm{merged}}$ 3	Model Soup, Task Arithmetic
Geometric/FR	Karcher mean on FR manifold	Spherical proxy fixed-point
Optimization-based	(Bi-level) supervised/validation tuning	SuperMerge, Output-Space QP
Dynamic/Modular	Component-wise, router-based	Twin-Merging, MERGE
Bayesian	MAP with anchor priors	Bayesian Model Merging
Covariance-aware	Layer-wise interference minimization	ACTMat
Distributional/heterogeneous	Output alignment/fusion	Unconstrained Model Merging

3. Empirical Performance and Scalability

State-of-the-art studies demonstrate several critical empirical properties:

Collapse Avoidance: Geometry-aware or norm-preserving methods (e.g., Fisher–Rao Karcher mean) preserve activation variance and effective-rank, preventing the rapid degradation observed in naive Euclidean blends as the number and heterogeneity of experts increases. On Qwen2.5-14B, Karcher merging yields average task performance of 0.610 for $\theta_{\mathrm{merged}}$ 4 experts vs. 0.542 (LERP) and 0.239 (Multi-SLERP) (Wang et al., 5 Mar 2026).
Combinatorial Reasoning: Layer-wise or distribution-based fusion can produce emergent abilities, outperforming even the best individual source expert. For instance, merging math and code experts produces superior results on code-solving-math tasks, exceeding the performance of both original models (Zhang et al., 2024).
Scalability Constraints: Theoretical analysis predicts a sharp upper bound on the number of mergeable experts, with diminishing returns due to parameter space saturation (Gaussian width analysis) (Wang et al., 27 May 2025). Empirically, expert count saturation typically occurs at 4–6 for vanilla merges, but can be extended with heavy-tailed parameter reparameterization and modularization.

4. Practical Algorithms, Implementation Guidelines, and Diagnostics

Efficient WSM necessitates careful trade-offs in computational, memory, and data requirements:

Block-wise/parallel merging exploits tensor structure for both memory efficiency and parallelization. Merging is practical over $\theta_{\mathrm{merged}}$ 5 iterations and $\theta_{\mathrm{merged}}$ 6 tens of models (Wang et al., 5 Mar 2026).
Memory management: Hierarchical merging (e.g., tree-based merging of small subsets) bounds peak memory, enabling scaling to $\theta_{\mathrm{merged}}$ 7– $\theta_{\mathrm{merged}}$ 8 (Yang et al., 2024).
Minimal validation requirements: Many optimization-based methods require only tens of examples per-task for effective parameter estimation.
Performance diagnostics: Variance, effective-rank preservation, and residual energy (fraction captured by output-space basis) act as forward diagnostics for merge reliability (Wang et al., 5 Mar 2026, Evans et al., 27 May 2026).
Tunable objectives: Supervised merging weights, router architectures, SVD ranks, and merge duration (for checkpoint aggregation in pre-training) provide levers for balancing capacity, generalization, and compute.

5. Applications, Ecosystem, and Limitations

WSM is now widely used in:

Multi-task and instruction following LLMs: Unifying domain-specific LLMs for math, code, translation, and safety alignment without retraining (Zhang et al., 2024, Song et al., 10 Mar 2026).
Federated and privacy-preserving learning: Decentralized (disjoint, private) experts are merged without central data exposure (Camacho et al., 2024).
Model compression and adapter fusion: Low-rank LoRA modules or SVD-compressed experts enable memory-efficient storage and reversible merging (Alipour et al., 15 Oct 2025).
Continual and decentralized LLM ecosystems: Community-driven merging enables composition of public/private experts into next-generation models (Zhang et al., 2024).
Dynamic routing and input-aware specialization: Modular expert libraries with input-conditioned routers support efficient batch inference and on-demand adaptation (Lu et al., 2024, Qiu et al., 6 Feb 2026).

Key toolkit and benchmark initiatives include MergeKit, FusionBench, and standardized multi-task evaluation suites.

Limitations and open challenges include:

Mergeability prediction: Despite advances, a general theory predicting which fine-tuned models will merge successfully is incomplete. Empirical diagnostics such as base-model prior accuracy and empirical mergeability scores are currently the most reliable predictors (Rahamim et al., 10 Jan 2026).
Beyond architecture homogeneity: Most methods assume identical, aligned architectures, although heterogeneous-architecture strategies exist via output distribution fusion (Zhang et al., 2024) and padded/permuted assembly (Zhang et al., 27 Mar 2025).
Scaling beyond $\theta_{\mathrm{merged}}$ 9 experts: Theoretical and empirical saturation is a fundamental issue; only advanced approaches with dynamic, modular, or heavy-tailed augmentation scale further (Wang et al., 27 May 2025).
Cross-modal and generative scenario extension remains a frontier.

6. Theoretical Bounds, Mergeability, and Failure Modes

Weight-space merging is inherently constrained by the dimensionality and effective geometry of the parameter space:

Mode connectivity and loss landscape geometry are preconditions; interpolation is effective only when solutions reside within the same basin.
Mergeability is highly non-uniform across fine-tuned updates; base model prior knowledge is the dominant predictor of a weight update’s survival probability under merging (Rahamim et al., 10 Jan 2026).
Upper bound on merged experts is dictated by the residual variance: in uniform-merge scenarios with correlation $f(x; \theta_{\mathrm{merged}})$ 0, the advantage over the base model decays as $f(x; \theta_{\mathrm{merged}})$ 1 in number of experts $f(x; \theta_{\mathrm{merged}})$ 2.
Gaussian width analysis shows marginal gain per added expert decays strictly concavely: fast initial benefit, followed by diminishing returns and statistical saturation (Wang et al., 27 May 2025).
Failure modes and mitigation: Severe performance collapse arises from norm shrinkage, rank reduction, and parameter-space conflicts, especially with interfering, highly heterogeneous, or poorly aligned experts. Geometry-aware merging, weighting by inverse base-model accuracy, sparsification, and modularization can alleviate these effects.

7. Outlook and Research Frontiers

Model merging in weight space remains a fast-evolving research area. Near-term trajectories include the development of:

Dynamic, multi-granular, and adaptive merging strategies integrating modular decomposition, input-aware routing, and context-dependent fusion.
Unified frameworks for cross-architecture, cross-modal, and multi-format expert merging.
Predictive models of mergeability and automated optimization of merge strategies leveraging geometric diagnostics and meta-learning.
Integration with safety, robustness, and continual adaptation protocols, spanning both LLMs and multimodal systems.

WSM stands as a foundational tool for programmatically composing, extending, and distilling collective intelligence from distributed neural models across the modern AI landscape (Wang et al., 5 Mar 2026, Zhang et al., 2024, Yang et al., 2024, Lu et al., 2024, Evans et al., 27 May 2026, Qiu et al., 6 Feb 2026, Alipour et al., 15 Oct 2025, Tian et al., 23 Jul 2025, Rahamim et al., 10 Jan 2026, Wang et al., 27 May 2025, Hameed et al., 1 Apr 2026, Song et al., 10 Mar 2026, Ruan et al., 12 Mar 2025).