Weight-Space Merging Techniques

Updated 14 September 2025

Weight-space merging is the process of integrating multiple specialized neural network models by directly combining their parameters, typically via convex or weighted interpolation.
Advanced methods such as Fisher weighting, dual-space constraints, and mixture-of-experts strategies address issues like permutation symmetry and task interference during merging.
This technique enables scalable multi-task and multi-domain applications, with practical use cases in federated learning, robotics, and modular model assembly.

Weight-space merging refers to the integration of multiple neural network models by combining their parameters directly in the space of model weights. Rather than retraining from scratch or via distillation, this paradigm enables the formation of multi-task, multi-domain, or generalist models by interpolating, averaging, or otherwise constructing new sets of weights from pre-existing, specialized models. Theoretical and algorithmic advances in weight-space merging address issues of task interference, permutation symmetry, architectural heterogeneity, loss landscape connectivity, and the need for training-free composability. Research has extended weight-space merging from basic arithmetic averaging to sophisticated dual-space, subspace-boosted, Bayesian, disentangled, and mixture-of-expert strategies. Its application spans transformers, CNNs, federated and continual learning, and model zoo assembly, offering an alternative to joint training for scalable, robust, and democratized multi-capable models.

1. Foundations and Basic Principles

Weight-space merging fundamentally exploits the fact that many neural network solutions trained from a common initialization reside in regions of parameter space connected by low-loss paths—an observation closely related to linear mode connectivity and the geometry of neural loss landscapes. The core procedure for the simplest merging is the convex interpolation of two or more sets of model weights, such as: $\theta_m = (1 - \lambda)\theta_A + \lambda \theta_B$ for models $\theta_A$ and $\theta_B$ and interpolation factor $\lambda$ (often $\lambda = 0.5$ for averaging) (Lawson et al., 2023). When models are fine-tuned from a shared pre-trained initialization $\theta_{pre}$ , task-specific updates can be represented as "task vectors" $\tau_x = \theta_x - \theta_{pre}$ , with merging interpreted as arithmetic on these vectors. This underlies so-called "task arithmetic" methods and means the merged model is: $\theta_m = \theta_{pre} + \frac{1}{N} \sum_{i=1}^N \tau_i$ which in practice centers the merging around a known initialization, reducing the likelihood of destructive interference.

Permutation symmetries, especially in multi-layer neural nets, create additional complications. Neurons in each hidden layer (except for output) can be arbitrarily permuted, exposing the need for weight alignment—optimally matching neurons or filters prior to merging (Navon et al., 2023). Without alignment, naïve averaging can land the merged model outside well-connected loss basins.

2. Advanced Methodologies

Recent advances have systematically addressed the principal limitations of basic weight averaging—degraded performance with diverse or task-distant models, inability to handle permutation and scale mismatches, and interference among tasks:

Fisher Information Weighting: Parameters are merged with weights proportional to their task-specific Fisher information, giving more influence to parameters critical to each original task (Lawson et al., 2023). For parameter $j$ ,

$\theta_m^j = \frac{\sum_i F_i^j \theta_i^j}{\sum_i F_i^j}$

where $F_i^j$ is Fisher information for parameter $j$ in task $i$ .

Dual-space Constraints: MuDSC introduces merging objectives that jointly maximize similarity of matched units in both weight and activation space, using a blended similarity matrix:

$\arg\max_{\mathbb{P}_l}\sum_l \langle \alpha\mathbb{C}(Z_l) + (1-\alpha)\mathbb{C}(A_l),\,\mathbb{P}_l\rangle$

where $\mathbb{P}_l$ is a layer-wise merging matrix, and $\alpha$ balances weight versus activation similarity (Xu et al., 4 Mar 2024).

Mixture-of-Experts (MoE) Merging: Critical (high-variance) modules, as determined by measuring parameter sensitivity to tuning, are merged via dynamic MoE routing, while non-critical layers use static arithmetic. This approach (WEMoE, E-WEMoE) adaptively routes input features through a learned combination of task-specific modules, significantly reducing task interference (Shen et al., 29 Oct 2024).
Low-Rank and Subspace Boosting: Centered task vectors (subtracting the average, rather than the pre-trained initialization) and low-rank approximations are used to minimize unnecessary cross-task interference. SVD is used to identify and preserve principal task-specific subspaces, while "subspace boosting" counters the tendency for rank collapse (i.e., the merged representation spanning a low-dimension) as more models are merged (Choi et al., 11 Dec 2024, Skorobogat et al., 19 Jun 2025).
Adaptive Weight Disentanglement: Task interference is further reduced by adaptively extracting and subtracting shared, redundant components from task vectors, enforcing approximate orthogonality and minimizing overlap in their contributions. This is achieved by optimizing an orthogonality penalty with a norm constraint on the redundant vector (Xiong et al., 27 Nov 2024).
Bayesian and Optimization-based Merging: Rather than naive averaging, Bayesian formulations merge variational posterior approximations of each task and allow for flexible reweighting via surrogate likelihoods or Hessian-weighted combinations. Bayesian optimization is also used to search for optimal interpolation weights when merging checkpoints in LLM pretraining (Maldonado et al., 11 Dec 2024, Liu et al., 28 Mar 2024).
Renormalized SVD Alignment: The Decom-Renorm-Merge (DRM) framework performs joint SVD across weight updates, followed by a critical renormalization step to ensure alignment and comparability of features, then entry-wise merging in the decomposed space (Chaichana et al., 29 May 2025).

3. Weight-Space Merging in Heterogeneous and Modular Architectures

Merging models with architectural differences requires new strategies:

Layer Alignment and Elastic Zipping: For depth heterogeneity, the deeper model is partitioned into segments whose representations are aligned with those of a shallower model using similarity metrics, and merged layer-wise. For width heterogeneity, "elastic neuron zipping" involves projecting weights from differently sized layers into a common space and merging similar neurons, so that the constraint of matched layer dimensions is lifted (Xu et al., 29 Dec 2024).
Model Assembly Learning (MAL): Merging can be performed at the granularity of individual layers, with generalized (and, in the dimension-mismatched case, bidirectional) permutations applied to align layers of different sizes. MAL selects and aligns layers across models in a model zoo, allowing "modular" construction of novel architectures with layer-wise linear mode connectivity (Zhang et al., 27 Mar 2025).
Metric Space Merging via Directed Graphs: In highly modular systems, interactions between components can be formalized using metric spaces whose distances are merged over a directed graph, enabling product-space integration where the metric structure aligns with task or module relationships (Can et al., 9 May 2025).

4. Empirical Assessment and Loss Landscape Analysis

Experiments across vision, language, and robotic control tasks reveal:

Scalability: With approaches that emphasize alignment, subspace preservation, or dynamic routing, weight-space merging scales to the integration of 10–20 or more expert models with minimal performance loss (Ye et al., 2023, Shen et al., 29 Oct 2024, Skorobogat et al., 19 Jun 2025).
Connectivity and Low-Loss Basins: Loss landscape visualizations (e.g., interpolating between model weights or plotting merged model positions under PCA) confirm that advanced merging techniques (such as dual-space or subspace-boosted methods) land the merged solution in regions with consistently low loss for all tasks, whereas naïve averaging may be trapped near the basin of only one task (Xu et al., 4 Mar 2024).
Task Interference and Orthogonality: Methods that enforce or approximate orthogonality between task vectors (either directly with penalties (Xiong et al., 27 Nov 2024) or via SVD-like procedures (Skorobogat et al., 19 Jun 2025)) exhibit reduced task interference, as measured by the gap between per-task and merged-task performance.
Diminishing Returns and Rank Collapse: Merging more experts without compensation leads to diminishing performance improvements, explained by "rank collapse"—energy in the merged task vector becomes restricted to a few principal directions.

5. Practical Applications and Implications

Weight-space merging profoundly impacts several application domains:

Multi-task and Multi-domain Deployment: By integrating individually fine-tuned models without retraining, organizations can deploy single models with broad, robust task coverage while minimizing engineering, storage, and privacy challenges (Ye et al., 2023, Wang et al., 14 Nov 2024).
Federated and Continual Learning: Merging supports decentralized model improvement and flexible addition of new tasks (via sequential orthogonal projection or memory-efficient schemes), with constant memory cost and minimal interference (Xu et al., 22 Aug 2024, Tang et al., 16 Jan 2025).
Robotics and Control: For sequential decision transformers, merging enables the decentralization of policy specialization and collaborative policy formation without requiring full data centralization or expensive retraining (Lawson et al., 2023).
Bayesian Optimization in LLM Pretraining: Checkpoint merging combined with Bayesian optimization can increase performance and generalization without extra training, providing an efficient way to select or compose model "soups" (Liu et al., 28 Mar 2024).
Heterogeneous Modular Assembly: Via model assembly and elastic zipping, practitioners construct adaptable, modular neural systems by fusing task experts of divergent size and shape (Zhang et al., 27 Mar 2025, Xu et al., 29 Dec 2024).

6. Algorithmic and Theoretical Considerations

Weight-space merging methods face structural and theoretical constraints:

Permutation Symmetry: Permutation invariance of neuron ordering must be resolved for successful merging; methods range from combinatorial optimization to learning-based fast alignment (Navon et al., 2023).
Subspace Geometry and Spectral Properties: The merging efficacy is dictated by the singular value spectrum of the task vector space. Techniques such as SVD, subspace boosting, and HO-GSVD quantify and preserve diversity, facilitating interpretable and effective merging (Skorobogat et al., 19 Jun 2025).
Loss Function Geometry and Interpolation Barriers: Regularization (Fisher weighting, weight scope alignment) and orthogonality-based projections explicitly address connectedness or barriers in the loss landscape, ensuring that merging does not induce catastrophic performance drops between tasks (Xu et al., 22 Aug 2024).
Regularization and Distribution Matching: Weight scope alignment and related methods control for scale and mean mismatches across models, regularizing towards a shared statistical target and facilitating more effective averaging (Xu et al., 22 Aug 2024, Choi et al., 11 Dec 2024).

7. Limitations and Ongoing Directions

Despite impressive empirical results, weight-space merging is constrained by:

Task Dissimilarity: High task similarity (non-orthogonal data or objectives) can lead to destructive interference; methods based on explicit orthogonality, careful task vector selection (via generalized SVD or alignment matrices), or task disambiguation are critical.
Model Heterogeneity: Non-matching architectures historically posed major barriers, but recent advances have ameliorated this by working at the level of segments, elastic projections, or modular selection.
Over-regularization and Flexibility: Excessive constraint (e.g., too-strong regularization or pruning) can underfit, while under-constrained merging may incur uncontrolled drift.
Scalability: As the number or diversity of models increases, maintaining effective rank, stability, and reasonable computational efficiency becomes paramount. Subspace-boosted and memory-efficient merging are active research topics.

A plausible implication is that future research will further integrate spectral and geometric insights, modular approaches, and data-free or privacy-preserving merging strategies, making weight-space merging a practical default in both academic and production multitask settings.