Parameter-Space Merging Techniques
- Parameter-space merging is a training-free technique that integrates specialized models by combining their parameters directly to form a unified network capable of multi-domain competence.
- It utilizes methods such as linear interpolation, geometry-aware merging, and sparse adapter strategies to balance retention of source-task performance with interference reduction.
- Practical implementations like MergePipe demonstrate improved scalability and efficiency by minimizing I/O overhead and maintaining inference cost parity with standard models.
Parameter-space merging refers to a family of training-free techniques that combine the parameters of multiple specialized models (often fine-tuned from a common base) into a single network capable of multi-domain or multi-task competence. These approaches operate directly in the weight space, contrasting with ensembling (which averages predictions at inference) or classical fine-tuning (which adapts parameters via gradient descent on new data). Parameter-space merging spans simple linear interpolation, functionally-aware geometric approaches, block/adapter-based constructions, and scalable systems for efficient model fusion.
1. Foundations and Motivation
The primary objective of parameter-space merging is to efficiently integrate specialized capabilities from multiple expert models into a unified backbone, avoiding the computational and data costs of joint multi-task retraining. For LLMs and vision transformers, this permits rapid incorporation of domain experts, seamless extension to new tasks, and significant inference-time efficiency—since merged models incur no inference cost overhead compared to standard ensembles (Wang et al., 5 Feb 2026, Lin et al., 12 Feb 2026, Junhao et al., 8 Mar 2025).
A central technical challenge is to balance retention of source-task performance with suppression of destructive interference. Simple linear averaging works only when experts are “close” in parameter space and can easily fail for models from distinct loss basins or under architectural symmetries (Junhao et al., 8 Mar 2025, Silva et al., 29 Apr 2026). Consequently, parameter-space merging research has evolved diverse methodology classes, from classic arithmetic to advanced function-geometry-aware and data-guided techniques.
2. Linear, Arithmetic, and Sparse Parameter-Space Merging
The most direct class of parameter-space mergers includes linear interpolation, “task arithmetic,” and their sparse/pruned variants:
- Linear Interpolation: The merged model is constructed as with scalar weight . Task arithmetic generalizes this to sum the difference vectors from a common base, i.e., (Lu et al., 30 Jan 2026, Cao et al., 28 Feb 2026, Junhao et al., 8 Mar 2025).
- Sparse and Pruned Methods: To address interference and redundancy, sparsification heuristics are applied. TIES (Lu et al., 30 Jan 2026) and DARE (Lu et al., 30 Jan 2026, Zhou et al., 21 Feb 2025) prune or drop small-magnitude components in the task vector, keeping only a subset of “delta” parameters. Mixup Model Merge (M³) introduces randomized linear interpolation using , improving robustness and diversity (Zhou et al., 21 Feb 2025).
- Sensitivity-Guided Weighting: Sens-Merging (Liu et al., 18 Feb 2025) augments arithmetic approaches with parameter-wise scaling based on task-specific importance (measured by gradient or activation sensitivity) and cross-task transferability, allowing layer- and expert-dependent contribution ratios.
- Sparse Complementary Fusion (SCF-RKL): This advanced variant uses reverse KL divergence between output distributions as an information-theoretic saliency signal, applying sparse updates only where functional divergence is largest, thus reducing interference and preserving dominant predictive modes (Lin et al., 12 Feb 2026).
Table: Example Merge Formulas
| Method | Equation | Core Principle |
|---|---|---|
| Linear | Equal or tuned averaging | |
| Task Arithmetic | Baseline-referenced update | |
| Sens-Merging | Sensitivity-scaled layerwise | |
| SCF-RKL | where =RKL-saliency mask | Sparse, distribution-aware |
While simple arithmetic merges suffice under limited heterogeneity, they are highly sensitive to expert distance, local loss geometry, and effective parameter subspace—increasing the risk of interference as the number of merged experts grows (Wang et al., 27 May 2025, Junhao et al., 8 Mar 2025).
3. Geometry- and Function-Aware Merging
For greater robustness across diverse experts, merging methods leverage the intrinsic geometry of models or the function-space distance between their output distributions.
- Fisher–Rao Geometry: The Fisher–Rao metric equips parameter space with a Riemannian structure such that geodesic distance approximates the expected KL divergence between output distributions. Merging as the Karcher (Fréchet) mean on this manifold, typically solved via a fixed-point or gradient-descent iteration using spherical proxies, minimizes average KL divergence to all experts. This approach preserves norm and functional variance, directly mitigates activation collapse, and generalizes to -way merges without bespoke scheduling (Wang et al., 5 Mar 2026, Silva et al., 29 Apr 2026).
- Fréchet/Aggregation on Quotient Manifolds: For models with architectural symmetries (e.g., LoRA low-rank adapters), naive Euclidean averaging fails to respect gauge symmetry. Merging on quotient manifolds factors out these symmetries, aligning expert orbits via Procrustes analysis and then averaging in the geometry-invariant subspace. This yields symmetry-consistent, data-free, and geometry-aware merges (Silva et al., 29 Apr 2026).
- Directional-Consistent Merging (DC-Merge): Merges are performed in the subspace aligned to the dominant singular vectors (knowledge components) of each expert, smoothing the energy distribution (singular values) and preserving the directional geometry. This approach strictly maintains the representational axes underlying functional knowledge transfer and consistently outperforms prior approaches in high-task-count vision and vision-language fusions (Zhang et al., 6 Mar 2026).
These geometry-aware techniques demonstrate greater stability, reduced variance collapse, and preserved functional capacity, even as task heterogeneity and the number of merged models increase.
4. Algorithmic, Block, and Adapter-Level Techniques
Parameter-space merging propagates beyond layerwise or all-parameter rules, encompassing multi-granular and adapter-specific strategies:
- Block-Wise and Adapter Merging: Techniques such as Core Space (Panariello et al., 22 Sep 2025, Cao et al., 28 Feb 2026) and dynamic core-space MoE (CoMoL) (Cao et al., 28 Feb 2026) reframe low-rank adaptation and mixture-of-experts (MoE) architectures within a small core subspace. This permits efficient and accurate merging using SVD-aligned bases, combining multiple LoRA experts with minimal computational and storage overhead—enabling scalable merging for up to 8B+ parameter models.
- Sparse Adapters: Highly sparse, SNIP- or connection-sensitivity-selected adapters enable the merging of many experts (up to 20) via elementwise averaging, achieving superior in-distribution performance to LoRA or full fine-tuning, and competitive held-out task performance (Arnob et al., 9 Jul 2025).
- Block-Wise Search and AutoMerge: When models span multiple structural domains (e.g., CNN, transformer, MLP), per-block hyperparameter search (using Bayesian optimization) yields robust merges that greatly outperform whole-model schemes, especially beyond LLMs (e.g., in CCT for vision, Interfuser for autonomous driving) (Lu et al., 30 Jan 2026).
- Representation-Level Correction: Once parameter-space merging is complete, closed-form linear transformations at the representation layer can be used for Pareto-optimal trade-off adjustment. These can rapidly adapt the merged model to new preference vectors with linear-in-task complexity (Wu et al., 14 Nov 2025).
5. Scalability, System Design, and Budget-Aware Merging
Scalability to high expert count and large-model sizes introduces practical bottlenecks due to disk throughput, memory, redundancy, and task interference. Recent advances have reframed model merging as a systems problem:
- MergePipe (LLM-Specific): MergePipe treats parameter-space merging as a catalog-driven data management problem. Each checkpoint is partitioned into fixed-size blocks, tracked using a transactional catalog. Its cost-aware planner and streaming execution engine explicitly model expert block I/O costs, enforcing user-specified budgets 0 and greedily selecting blocks to touch based on conflict and cost signals. MergePipe consistently avoids 1 I/O blowup as the number of experts 2 increases, demonstrating up to 10–203 reductions in expert-read I/O, order-of-magnitude total I/O cut, and up to 114 end-to-end speedups compared to stateless pipelines (Wang et al., 5 Feb 2026).
Table: MergePipe Cost Components
| Component | Definition |
|---|---|
| 5 | Base model read (fixed) |
| 6 | Output/write cost (fixed) |
| 7 | Metadata cost (fixed/bounded) |
| 8 | Expert read; controlled, major scaling cost, target for budgeting |
Through transactional guarantees and block-level planning, MergePipe ensures auditability, robustness, and efficiency at LLM scales and is operator/merge-algorithm agnostic.
6. Theoretical Guarantees, Limitations, and Scaling Laws
Parameter-space merging is subject to theoretical limitations on scalability, interference, and informational capacity:
- Subspace Saturation and Diminishing Returns: As the number of merged experts increases, the effective parameter space (as measured by ellipsoidal Gaussian width or statistical dimension) rapidly saturates, with the marginal benefit of each new expert diminishing strictly concavely (Wang et al., 27 May 2025). There exists a unique threshold (e.g., 9) determined by the loss landscape and model dimension, beyond which additional experts cannot be accommodated without performance collapse.
- Function-space and Ensemble Equivalence: Under smooth loss and bounded Hessian, the excess risk of parameter-space merging is provably controlled by that of prediction-level ensembling, with a quadratic dependence on the offset between models (Li et al., 3 Mar 2025). Data-dependent, per-example coefficient selection networks (e.g., NeuLig) can nearly close the merge–ensemble performance gap.
- Unified Generalization Theory: 0 stability bounds for merging under heterogeneous fine-tuning show that generalization error depends on both the stability (controlled by learning rate, batch size, and step count) and task heterogeneity. Actionable hyperparameter guidance emerges: small learning rates, moderate fine-tuning steps, large batches, and homogeneous tasks maximize merge-friendliness (Li et al., 29 Jan 2026).
- Limits and Defenses: Parameter-level merging can be proactively defended against (PaRaMS) by rearranging or scaling parameters while functionally preserving standalone accuracy; this pushes the protected model out of the shared basin and collapses post-merge performance for untrusted merges (Junhao et al., 8 Mar 2025).
7. Future Directions and Open Challenges
Persistent challenges include:
- Heterogeneous Architecture Merging: Existing methods predominantly target models with identical backbones; merging cross-architecture or cross-modal experts remains largely unsolved (Lu et al., 30 Jan 2026).
- Merge-Friendly Fine-Tuning: Theoretical work motivates fine-tuning protocols explicitly designed for stable downstream merging, but practical recipes for large-scale cross-domain settings are in early stages (Li et al., 29 Jan 2026).
- Functionally Guided and Representation-Aware Merging: Data- and embedding-driven approaches (e.g., ES-Merging) suggest avenues to further decrease interference and increase specialization by leveraging input- and representation-level signals rather than parameter-only heuristics (Lee et al., 15 Mar 2026).
- Scalable, Budgeted, and Auditable Systems: As the scale and diversity of expert models grows, demand for integrated data management systems, I/O budgeting, and transactional execution will intensify, exemplified by designs like MergePipe (Wang et al., 5 Feb 2026).
The field continues to evolve toward merging frameworks that balance theoretical guarantees, functional capacity, information-geometric principles, and practical tractability at scale.