Neural Merger: Integrating Neural Networks

Updated 23 December 2025

Neural merger is a systematic framework that fuses independently trained models to consolidate diverse knowledge while minimizing redundancy.
It employs various alignment techniques, including canonical correlation analysis and neuron-level adjustments, to address feature misalignment and interference.
Advanced implementations achieve near-ensemble performance with efficiency gains across CNNs, transformers, and MLPs in multi-task and multimodal settings.

A neural merger is a systematic methodology or framework for combining two or more independently trained neural network models into a single network that integrates their functionalities or knowledge representations, while typically minimizing redundancy, inference cost, or task interference. The process is applicable across a range of architectures—including CNNs, transformers, and multilayer perceptrons—and encompasses a spectrum of strategies from direct parameter alignment and averaging, to sophisticated neuron-wise, layer-wise, and subspace-based techniques. Modern neural merger methods address both the technical challenges of feature misalignment and destructive interference, as well as practical constraints related to efficiency, reliability, and scalability.

1. Foundations: Rationale, Loss Landscapes, and Challenges

Neural model fusion is motivated by the desire to capture the performance and knowledge diversity of model ensembles within a single, efficient architecture. Simple parameter averaging frequently fails due to non-convex, high-dimensional loss landscapes: distinct models typically reside in isolated parameter minima separated by high barriers, making weight interpolation suboptimal. Neural networks also exhibit permutation invariance within layers, complicating direct parameter-wise merges (Horoi et al., 7 Jul 2024).

Aligning models to facilitate merging thus requires explicit consideration of neuron correspondences and transformations. Recent findings show that features learned across independently trained networks often inhabit similar function spaces but are linearly entangled, necessitating alignment protocols beyond one-to-one permutation.

2. Alignment and Fusion Methodologies

Neural merging methodologies can be categorized by the granularity and flexibility of alignment.

a. Canonical Correlation Analysis (CCA) Merge

CCA Merge employs canonical correlation analysis to align neuron subspaces across models by maximizing the correlations of linear projections of activations, rather than enforcing strict one-to-one neuron matching. For each fusion layer, projection matrices are determined via singular value decomposition of scatter matrices; alignment is then performed by invertible linear transforms. The layer is merged via transformed averaging: $W_i^{\text{merged}} = \tfrac{1}{2}(W_i^A + T_i W_i^B T_{i-1}^{-1})$ where $T_i$ aligns activations from model B to model A's subspace. This approach enables many-to-many integration of distributed features and is extended to more than two models by iterative all-to-one fusion (Horoi et al., 7 Jul 2024).

b. Neuron-level and Subspace-based Merging

Methods such as Locate-then-Merge and NeuroMerging integrate models by decomposing parameter differences or task vectors at the neuron level. In Locate-then-Merge (Yu et al., 22 May 2025), critical neurons (those with large parameter deltas post-fine-tuning) are either fully reinstated or rescaled in the merged model, while widespread small perturbations are suppressed, mitigating catastrophic forgetting and interference.

NeuroMerging (Fang et al., 7 Mar 2025) decomposes each fine-tuned neuron's update into sensitivity (parallel to the pre-trained direction) and adaptability (orthogonal) subspaces, merging these separately. This reduces destructive task interference typical in naïve arithmetic averaging by consolidating novel features in the orthogonal subspace and preserving shared structure.

c. Layer-wise and Chain-of-Merges Protocols

Methods such as Chain of Merges (CoM) (Buzzega et al., 29 Aug 2025) address inter-layer dependencies by recursively merging layers in an auto-regressive manner. At each layer, the merged parameters are optimized given the merged activations from previous layers, thereby mitigating distributional mismatches (merging covariate shift) that arise when independent layer merges are performed without accounting for changing input statistics.

3. Practical Neural Merger Algorithms and Technical Considerations

Method	Alignment Principle	Key Feature
CCA Merge	CCA subspace alignment	Many-to-many feature matching
Locate-then-Merge	Neuron importance/delta	Preserves high-importance neurons, suppresses small deltas
NeuroMerging	Subspace (parallel/orthogon)	Task interference mitigation via subspace decomposition
CoM	Auto-regressive regression	Eliminates covariate shift by updating per-merged layer

Detailed protocol selection depends on architecture, data availability, and whether the models originate from common or independent initializations. For CCA Merge, activation computation and SVD dominate cost, though overall runtime is modest (e.g., ~34s for ResNet20×8 on CIFAR-100) (Horoi et al., 7 Jul 2024). CoM requires only a small activation calibration set (50–100 in-domain samples per model) (Buzzega et al., 29 Aug 2025).

In neuron-centric methods, neuron importance may be estimated via delta magnitude or gradient-based attribution scores (e.g., SNIP (Ma et al., 24 Feb 2025)), with suppression or masking applied to prevent cross-task interference. Regularization hyperparameters and alignment datasets are necessary in all advanced merger protocols, especially those relying on activation alignment.

4. Extensions: Multi-model, Heterogeneous, and Efficient Merging

Neural merger frameworks are extended to support:

Multiple models: CCA Merge and CoM support all-to-one or auto-regressive fusion of $p\gg2$ models, with CCA Merge showing less than 4% accuracy drop when merging 20 models versus >15% in permutation-based methods (Horoi et al., 7 Jul 2024).
Task and architecture heterogeneity: Advanced approaches such as Foldable SuperNet Merge (FS-Merge) (Kinderman et al., 2 Oct 2024) optimize merge-specific parameterizations (merge/unmerge matrices) using feature reconstruction loss, enabling merger of transformers trained from distinct random inits and on different tasks; previous methods catastrophically fail in this regime.
Partial, selective, and router-based merges: Algorithms like LED-Merging (Ma et al., 24 Feb 2025) introduce explicit isolation of task-specific parameter slots through location-election-disjoint rules, resolving safety-utility conflicts in multi-specialist LLM merging. Router-based approaches such as MIN-Merging (Liang, 18 Oct 2025) dynamically select and merge only core neurons or layers of each expert per input, further minimizing destructive interference.

5. Empirical Performance and Evaluation

Empirical benchmarks across vision, language, and multimodal domains demonstrate that modern neural merging methods (CCA Merge, CoM, Foldable SuperNet Merge, Neuron-Fusion) consistently close the gap to ensemble performance. For example, in two-model VGG11×1 tests, CCA Merge achieves 82.7% accuracy vs. direct averaging (often <60%) and ensembles (89.6%). With 20 models, accuracy drops less than 4% with CCA Merge compared to over 15% for permutation matching (Horoi et al., 7 Jul 2024). For multimodal LLMs, Neuron-Fusion recovers language ability while retaining vision skills post-merge, outperforming prior baselines on both language and vision tasks (Yu et al., 22 May 2025).

Router-based and selective neuron approaches (MIN-Merging, LED-Merging) show enhanced in-domain performance (e.g., 83.3% on NLP GLUE merges, exceeding fine-tuned upper bounds) while retaining out-of-domain generalization (Liang, 18 Oct 2025). Selective masking and gradient-based neuron identification enable robust merging without catastrophic task interference or safety degradation (Ma et al., 24 Feb 2025).

6. Limitations, Open Questions, and Future Directions

Despite significant advances, open technical challenges remain:

Data requirements: Most merging techniques require auxiliary datasets (activations or calibration samples) for alignment or regression, which may not be available in federated or private settings (Horoi et al., 7 Jul 2024, Buzzega et al., 29 Aug 2025).
Extension to truly heterogeneous architectures: While some methods (e.g., Foldable SuperNets) allow width mismatch, merging across fundamentally different depths or model families (e.g., ViT and ResNet) is not yet fully supported (Kinderman et al., 2 Oct 2024).
Covariate shift and theoretical foundations: Deeper theoretical understanding of merging covariate shift and linear mode connectivity (beyond permutations or subspace projections) is needed (Horoi et al., 7 Jul 2024, Buzzega et al., 29 Aug 2025).
Task interference and safety: Guaranteeing non-interference (especially safety-utility conflicts) remains an ongoing research area, with structured masking and election stages providing partial solutions (Ma et al., 24 Feb 2025).
Scalability and efficiency: While methods such as CoM and MIN-Merging are designed for scaling to many experts or tasks, further reductions in memory and compute cost (e.g., via low-rank or hierarchical strategies) are anticipated in upcoming research.

In conclusion, neural merger technologies have evolved from simple parameter-space heuristics to sophisticated, data-efficient alignment protocols that combine models at the subspace, neuron, or layer level, with principled control over interference and feature preservation. These approaches enable unified multi-task models with much of the performance and diversity of ensembles, but with the computational and memory footprint of single-model inference (Horoi et al., 7 Jul 2024, Buzzega et al., 29 Aug 2025, Yu et al., 22 May 2025, Liang, 18 Oct 2025, Ma et al., 24 Feb 2025).