Parameter-Space Model Fusion

Updated 4 December 2025

Parameter-space model fusion is a method that directly combines trained neural network parameters into a single model for one-pass inference.
It employs techniques such as mode connectivity, alignment, weight averaging, and Bayesian optimization to overcome high-dimensional challenges.
This approach enhances generalization, multi-task performance, and federated learning efficiency while reducing deployment complexity.

Parameter-space model fusion is a family of techniques that directly merge multiple sets of trained model parameters—typically weight vectors or tensors—into a new network whose parameters reside in the weight space, producing a single model capable of one-pass inference. Unlike classical output-space ensembles (e.g., voting, averaging predictions), parameter-space fusion eliminates the need to store and deploy multiple models, and aims to inherit the competencies of each fused parent. This approach underpins advances across deep learning, multi-source data integration, efficient distributed inference, and statistical calibration, yet presents unique algorithmic, statistical, and computational challenges, particularly in the high-dimensional settings characteristic of modern large-scale neural networks (Li et al., 2023).

1. Conceptual Foundation and Problem Definition

Parameter-space fusion operates by finding a new parameter vector $\theta^* \in \mathbb{R}^d$ that synthesizes information from a set of pre-trained models $\{\theta_i\}$ , yielding a single, self-contained network. Formally, let $f(x; \theta)$ denote a model parameterized by $\theta$ , and suppose $K$ trained models $\{f(x; \theta_k)\}_{k=1}^K$ exist—possibly trained on different datasets, tasks, or under different initializations. The fusion process produces $\theta^*$ such that $f(x; \theta^*)$ incorporates knowledge and skills from multiple $\theta_k$ while supporting standard inference (Li et al., 2023).

Parameter-space fusion is distinguished from output-space (ensemble) fusion that aggregates predictions; here, the merge is performed directly in the parameter (weight) space, allowing for computational efficiency and seamless deployment. Primary motivations include improved generalization, robustness, multi-task performance, and federated learning scalability.

2. Core Methodological Families

Parameter-space model fusion spans a spectrum of algorithms, each addressing the intrinsic challenges of aligning, combining, and optimizing high-dimensional parameter sets.

2.1 Mode Connectivity

Mode connectivity exploits the empirical observation that independently trained minima of deep networks— $\theta_A, \theta_B$ —can often be joined by continuous paths $\theta(t)$ , $t\in[0,1]$ connecting $\theta_A$ to $\theta_B$ , along which the loss remains low. The canonical mode connectivity problem is:

$\min_{\theta(\cdot)} \max_{t\in[0,1]} L(\theta(t)) \qquad \text{subject to}~\theta(0)=\theta_A,~\theta(1)=\theta_B.$

Variants include piecewise linear paths (LMC), quadratic Bézier curves, subspace connectors, and robust mode connectivity (RMC), which incorporates adversarial perturbations for resilience (Li et al., 2023). These algorithms search for geometries in the loss landscape that support effective fusion and can provide new initialization points for further training or fusion.

2.2 Alignment and Matching

Owing to the permutation symmetry in neural network architectures (especially ReLU and transformer-based models), direct averaging across corresponding weights fails—parameters may represent functionally different features or neurons in each model. Alignment (sometimes labeled “matching”) seeks a permutation or continuous transformation (e.g., a permutation matrix $P$ or an orthogonal rotation $R$ ) that reorders or rotates the weights (and biases) so that same-functionality units are matched prior to averaging (Li et al., 2023, Zhang et al., 1 Feb 2025). This involves combinatorial optimization (e.g., Hungarian algorithm, Sinkhorn relaxation), activation-based cost metrics, and in transformers, rotation symmetry alignment that leverages the continuous orthogonality group for closed-form SVD-based matching.

2.3 Weight Averaging

Simple weight averaging takes the arithmetic or convex mean:

$\theta_{\text{avg}} = \frac{1}{K} \sum_{k=1}^K \theta_k,$

or more generally, weighted sums $\theta_{\text{weighted}} = \sum_{k=1}^K \lambda_k \theta_k$ with $\sum \lambda_k = 1$ . Stochastic Weight Averaging (SWA) and model soups extend this idea by averaging checkpoints along a single model’s trajectory or across related fine-tuned solutions (Li et al., 2023). Bayesian Model Averaging takes a probabilistic view, seeking $\theta^*$ that maximizes the aggregated posterior under Gaussian/Fisher approximations.

2.4 Multi-objective and Bayesian Optimization-guided Fusion

Recent advances introduce multi-objective Bayesian optimization (MOBO) to parameter-space fusion, optimizing fusion coefficients over model checkpoints to jointly minimize validation loss and maximize task-specific metrics. The BOMF framework formalizes this as a constrained black-box MOBO problem over the simplex of convex coefficients, utilizing GP surrogates and hypervolume-based Pareto optimization (Jang et al., 11 Nov 2024).

3. Parameter-Space Fusion in Broader Inference, Filtering, and Statistical Integration

Parameter-space fusion transcends standard deep learning and is foundational to coherent multi-source inference in statistical and sensor fusion contexts.

Monte Carlo Fusion introduces exact methods for fusing independent sub-posteriors $p_c(\theta)$ , yielding fused posteriors $p(\theta) \propto \prod_c p_c(\theta)$ . This is accomplished by rejection sampling in an extended parameter space defined via Brownian or Ornstein-Uhlenbeck bridge transitions, recovering product-of-experts targets without approximation bias (Dai et al., 2019).
Multiple Particle Filtering (MPF) with Parameter Fusion supports Bayesian filtering in separable high-dimensional state-space models. Each partition estimates a marginal for a shared parameter $\theta_g$ ; the global fused marginal is $p(\theta_g|y_{1:t}) \propto \prod_{k=1}^K \pi_{k,t}(\theta_g) / \rho_{t-1}(\theta_g)^{K-1}$ , leveraging independence for efficient joint inference (Zhao et al., 31 Oct 2024).
Markov Random Field (MRF)-based Fusion in sensor networks replaces intractable joint parameter likelihoods with accurate, pairwise separable pseudo-likelihoods, enabling scalable particle-based belief propagation on the fusion MRF (Uney et al., 2017).
Nonparametric Fusion via Depth Confidence Distributions constructs inference on fused parameters by combining centrality functions derived from paper-specific bootstrapped depth-CDs, achieving uniform confidence guarantees and Bahadur efficiency without parametric constraints (Liu et al., 2020).
Mean Structure Fusion with GMM Penalties exploits a quadratic inference framework and group fusion penalties to identify homogeneous parameter blocks across multi-site studies, optimizing via ADMM for efficient pooled inference (Hector, 2022).

4. Architectural and Methodological Innovations

Specialized frameworks and extensions address fusion in heterogeneous, high-dimensional, or parameter-efficient contexts.

Rotation Symmetry in Transformers generalizes discrete permutation symmetry to continuous rotations in self-attention projections, enabling closed-form global optimum alignment and vastly enlarging functionally equivalent classes, thereby reducing fusion loss barriers and alignment cost (Zhang et al., 1 Feb 2025).
Dynamic Permutation/Fusion (AutoFusion) dispenses with hand-crafted heuristics by learning layerwise (soft) neuron permutations via the Sinkhorn operator in an unsupervised, end-to-end setup, combining weight-space alignment and pseudo-label retention losses for broad multi-task adaptability (Tian et al., 8 Oct 2024).
Partial Linearization for Adapter Fusion in parameter-efficient fine-tuning linearizes only the adapter modules (“L-LoRA”) before fusion, preserving task disentanglement and outperforming naïve addition or full tangent-space approximations within the constraints of PEFT (Tang et al., 2023).
Heterogeneous Multi-source Fusion via Input Mapping and LVGP handles incompatible parameter spaces across data sources by mapping all sources into a reference space via calibrated affine transformations, then fusing with a latent-variable Gaussian process over both quantitative input and qualitative source embeddings (Comlek et al., 15 Jul 2024).
Multi-fidelity Gaussian Process Fusion with Parameter Reduction employs linear active subspaces or nonlinear invertible transforms to reduce intrinsic parameter dimensionality prior to fusion, enabling data-efficient high-fidelity surrogate modeling in computational engineering regimes (Romor et al., 2021).

5. Practical Applications, Comparative Performance, and Empirical Insights

Parameter-space fusion achieves demonstrable gains in diverse settings:

Language and Vision Model Fusion: Rotation symmetry-based alignment in transformers yields F1 and accuracy improvements across emotion and NER tasks, and large boosts in ViT-based classification, compared to (un)matched simple/Fisher/OT-fusion baselines (Zhang et al., 1 Feb 2025).
Multi-task and Adapter Fusion: Partial linearization (L-LoRA) achieves higher normalized multi-task performance as more tasks are fused, outperforming standard LoRA/adapter merging, and maintains orthogonality and disentanglement among task vectors (Tang et al., 2023).
Autonomous Engineering Design and Data Fusion: Heterogeneous mapping plus LVGP fusion reduces normalized RMSE by 10–68% over non-fused baselines in three engineering tasks and provides quantifiable measures of inter-source similarity (Comlek et al., 15 Jul 2024), while multi-fidelity GP fusion leveraging active subspaces increases surrogate accuracy with severe cost savings (Romor et al., 2021).
Sensor Network Calibration and Statistical Integration: MPF with fusion accelerates mean-square error convergence in state and parameter estimation compared to prior filtering methods (Zhao et al., 31 Oct 2024), and nonparametric depth-CD aggregation attains high-order accuracy and robust, interpretable error estimates in heterogeneous or incomplete studies (Liu et al., 2020).

6. Limitations, Open Challenges, and Future Research Directions

Unresolved obstacles span optimization, scalability, heterogeneity, and robustness:

Scalability in Extreme Dimensionality: Efficient, interpretable path-finding and alignment algorithms must be developed for billion-scale parameter tensors (Li et al., 2023).
Model Interference: Negative interference persists when fusing semantically divergent or highly specialized models; penalty-based and meta-learned mechanisms represent active directions to suppress such conflicts (Li et al., 2023).
Heterogeneous Architectures: Adapting permutation and rotation-based alignment to architectures with varying widths, skip connections, adapters, MoE blocks, or differing attention head structures remains largely unsolved (Li et al., 2023).
Automatic Hyperparameter and Weight Selection: Optimal fusion coefficient selection and meta-learning to predict fusibility are open areas, especially for large $K$ and highly nonconvex loss landscapes (Jang et al., 11 Nov 2024).
Integration with Privacy and Compression: Layered and sparse fusion schemes targeting federated and edge settings, with privacy and memory/resource efficiency constraints, warrant further exploration (Li et al., 2023).
Theoretical Guarantees: While some methods (e.g., Monte Carlo Fusion, depth-CD aggregation, rotation-symmetry alignment) offer rigorous optimality or consistency guarantees, general theory for empirical fusion in deep nets lags, especially in the presence of non-orthogonal or non-identically distributed models.

7. Schematic Comparison of Parameter-Space Fusion Families

Fusion Family	Strengths	Limitations
Mode Connectivity	Finds global low-loss paths, interpretable	High compute, path dimensionality, scaling issues
Alignment/Rotation/Permutation	Neuron matching, robust functional alignment	Combinatorial/matrix cost, architecture constraints
Weight Averaging	Simple, efficient, no retraining	Sensitive to basin overlap, may degrade accuracy
Bayesian Optimization/MOBO	Multi-metric, data-driven fusion weighting	Expensive, surrogate model limitations
Nonparametric/Depth-CD	Model-agnostic, high-order accuracy, heterogeneous	Computational intensity, dimension scalability
MPF/Filtering/MRF-based	Scalability for decomposable models	Requires separability, local independence

In sum, parameter-space model fusion comprises a rich, evolving field blending deep learning, statistical inference, optimization, and practical engineering. It continues to generate new theoretical, computational, and empirical insights, driving progress in multi-model, multi-source learning, robust adaptation, and resource-efficient deployment (Li et al., 2023, Zhang et al., 1 Feb 2025, Jang et al., 11 Nov 2024, Tang et al., 2023, Comlek et al., 15 Jul 2024, Zhao et al., 31 Oct 2024, Liu et al., 2020, Romor et al., 2021, Uney et al., 2017, Hector, 2022, Tian et al., 8 Oct 2024).