Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis
Abstract: Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the different learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at a smarter way to combine several trained neural networks into one single network. Instead of running many models at the same time (an “ensemble”), which is accurate but slow and memory-hungry, the authors want to “merge” the models’ parameters into one model that is fast and cheap to use—while keeping as much accuracy as possible.
Their new method is called CCA Merge. It uses a math tool named Canonical Correlation Analysis (CCA) to line up what different networks have learned so their “knowledge” can be averaged together more safely.
What questions are they trying to answer?
The paper focuses on simple, practical questions:
- Can we create one good model by merging several trained models, instead of ensembling them?
- Why do common merging methods fail, and how can we fix that?
- Can a smarter alignment method (based on CCA) help us merge not just two models, but many, without losing much accuracy?
- Will this work across different neural network types and datasets?
How did they do it?
Think of each neural network layer as a group of “neurons,” each detecting certain patterns (like edges, colors, or shapes in images). If you try to average two networks layer-by-layer without lining up these neurons first, you get a messy result and the model stops working. So the key step is alignment: making sure the “right” features are matched before averaging.
The problem with simple merging
- Simple averaging fails because the networks’ “brains” aren’t arranged the same way. Even if they learned similar features, they may store them in different neurons or spread one idea across several neurons.
- Older methods try to “permute” (reorder) neurons so that neuron A1 matches neuron B1, A2 matches B2, etc. But that assumes each feature has a one-to-one match, which often isn’t true.
A gentler way to line up features: CCA
- Canonical Correlation Analysis (CCA) is like finding a shared “language” between two groups.
- Instead of forcing a one-to-one match, CCA looks for the best way to mix and match linear combinations of neurons from each model so their outputs are as similar (correlated) as possible.
- In everyday terms: if one model spreads a concept across three neurons and another uses just one neuron for that same concept, CCA can still match them by blending those three into one.
How CCA Merge works, step by step
Here’s the idea in a few simple steps:
- Run both models on the same set of inputs to collect their internal activations (what each neuron “fires” at each layer).
- Use CCA to find two projection steps—like two translators—that map both models’ features into a common space where they look as similar as possible.
- Use these projections to compute a transformation that gently re-expresses Model B’s features to line up with Model A’s features.
- Average the aligned weights of the two models layer by layer to create a single merged model.
This is done at selected “merging layers” in the network (you don’t always need to do it after every layer).
Merging more than two models
To merge many models, they use a simple “all-to-one” approach: pick one model as the reference, align every other model to it using CCA, and then average all the aligned weights together.
What did they find?
Across several tests, their method performed better than popular alternatives.
They tried multiple neural network types and datasets: VGG on CIFAR-10, ResNet on CIFAR-100, and ResNet on ImageNet. Competing methods included:
- Direct averaging (no alignment)
- Permute-based alignment
- Optimal Transport (OT) methods
- Matching Weights (a stronger permutation approach)
- ZipIt! (a newer method that also merges neurons within the same model)
Key takeaways:
- CCA Merge consistently produced merged models with higher accuracy than other merging methods, sometimes by a large margin.
- It narrowed the gap between merging and ensembling. While ensembling still gave the best accuracy, CCA Merge came closer than the others—without needing to run multiple models at once.
- It worked better when merging many models, not just two. Other methods tended to lose a lot of accuracy as the number of models increased. CCA Merge stayed relatively stable.
- It handled tougher cases where models learned from different data splits. When the models saw different parts of the dataset (but the same classes), CCA Merge often matched or beat the average performance of the individual models—something most other methods didn’t achieve.
- The authors also showed why one-to-one matching (permutations) is limited: in real networks, a feature in one model often corresponds to several neurons in another. CCA’s “mixing” handles this better.
Why does this matter?
- Faster, cheaper, and simpler deployment: You can get ensemble-like benefits from a single merged model, saving memory and computation at test time.
- Better model collaboration: In settings like federated or distributed learning—where many models are trained separately—CCA Merge makes it easier to combine them into a strong global model.
- More robust tools: Because CCA doesn’t assume a strict one-to-one neuron match, it is more flexible and reliable across different models and training runs.
In short, CCA Merge is a practical step toward making model merging actually work in real life, helping us reuse and combine models more effectively without paying the high cost of ensembling.
Collections
Sign up for free to add this paper to one or more collections.