Linear Mode Connectivity in Neural Networks
- Linear mode connectivity is the phenomenon where independently trained neural network solutions, after symmetry alignment, can be connected by a straight line in weight space with nearly constant loss.
- It employs techniques like weight and activation matching to align symmetric parameters, eliminating barriers in non-convex loss landscapes.
- This property improves model merging, ensembling, and continual learning by revealing a smoother, more navigable loss geometry in deep networks.
Linear mode connectivity (LMC) is the empirical and theoretical phenomenon that two independently trained solutions (modes) of a neural network—often obtained via stochastic gradient descent (SGD) from different random seeds or data permutations—can, after proper symmetry alignment, be connected by a straight line segment in weight space along which the training or test loss remains essentially constant, with no intervening barrier. This structure, which appears surprising in highly non-convex loss landscapes, has significant implications for model merging, ensembling, optimization stability, architecture analysis, and the understanding of the deep learning loss geometry.
1. Formal Definition and Fundamental Characterization
Let represent two trained parameter vectors of a neural network and let denote the empirical risk or loss. The straight-line interpolation is
The pointwise interpolation loss is , and the barrier is
and are linearly mode connected if (typically, a near-zero increase above endpoint losses), i.e., the loss remains flat or has a negligible hump between the modes. In practice, one also measures the difference versus a convex combination, i.e., , but these are functionally equivalent in the overparameterized setting (Theus et al., 28 Jun 2025, Altintas et al., 2023, Entezari et al., 2021).
2. Symmetry Structure and Alignment Mechanisms
Neural networks possess extensive parameter-space symmetries that render many minima functionally equivalent yet parametrically distinct. These symmetries arise from:
- Permutations: Hard one-to-one neuron/channel reorderings: for permutation matrix , common in MLPs (hidden units) and CNNs (channels).
- Semi-permutations: Sub-stochastic, sparse matrices mapping one-to-many or many-to-one, particularly in residual or attention head blocks in deep transformers.
- Orthogonal Transformations: , relevant to normalization or residual subspaces (e.g., RMSNorm).
- Invertible Linear Maps (GL): Full-rank weight matrices in key modules, such as QK/OV projections in attention circuits.
Let be a symmetry group acting on parameter space. We seek
that is, the optimal symmetry (composition of permutations, semi-permutations, orthogonal maps, invertibles) to align so that linear interpolation with avoids loss barriers (Theus et al., 28 Jun 2025, Zhan et al., 8 Mar 2025).
Algorithmic approaches:
- Weight matching: Bilinear assignment (Hungarian algorithm) layerwise, possibly extended with Procrustes for orthogonal matching. Attention heads employ cost matrices based on Frobenius distances. (Theus et al., 28 Jun 2025, Ito et al., 2024)
- Activation matching: Data-driven, matches neuron activations empirically to align functional subspaces.
- Learned matching (end-to-end): Unconstrained alignments parametrized and optimized jointly with projections to the relevant symmetry classes, enabling gradient-based refinement and fully exploiting (Theus et al., 28 Jun 2025).
3. Theoretical Foundations and Empirical Findings
3.1. Emergence of LMC in Practice and Theory
Early empirical studies identified large barriers when interpolating between independently trained networks, but barrier-free interpolation emerges after symmetry alignment (permutation, etc.) (Entezari et al., 2021, Akash et al., 2022). The phenomenon is theoretically underpinned in wide two-layer networks via optimal transport and in multi-layer settings via recursively controlled neuron alignments. Localization of the minimal barrier is rate-limited by width and intrinsic layerwise dimension (Ferbach et al., 2023): for two-layer ReLU with width (Zhan et al., 8 Mar 2025); general bounds scale as for -dimensional supports (Ferbach et al., 2023).
3.2. Architectural and Optimization Dependence
LMC strongly depends on:
- Width: Sufficiently overparameterized (wide) networks are a prerequisite; double descent in the LMC barrier is observed as width increases (Zhan et al., 8 Mar 2025).
- Symmetry complexity: Transformers require more than permutations: semi-permutations, orthogonals, invertibles (Theus et al., 28 Jun 2025).
- Optimization regime: SGD at appropriate learning rates, batch sizes, and scheduling can promote or destroy LMC; adaptive optimizers (Adam) may break LMC by pushing iterates out of shared basins unless careful warmup is provided (Altintas et al., 2023).
- Dataset complexity: Harder classification tasks and deep architectures fragment the loss landscape, breaking LMC in the absence of massive overparameterization or symmetry exploitation (Altintas et al., 2023).
- Initialization and training phase: Models become LMC-stable only after early training; initializations rarely yield stability, and the timing of the SGD trajectory fork is crucial (Frankle et al., 2019, Singh et al., 2024).
3.3. Quantitative Results
Representative experiments on state-of-the-art vision transformers (ViT) and GPT-2 demonstrate that, for independently trained pairs:
| Method | LMC Barrier (CIFAR-10) |
|---|---|
| Vanilla averaging | 1.69 |
| Activation matching (AM) | 1.27 |
| Weight matching (WM) | 0.36 |
| Learned symmetries (perm only) | 0.45 |
| Learned all symmetries (full ) | 0.00 |
Zero-barrier interpolation is achieved only by accounting for the full hierarchy of symmetries in transformers. Weight or activation matching alone reduces, but does not entirely eliminate, the barrier. Similar findings apply to GPT-2: learned full-symmetry matching yields a barrier of 0.41 compared to 4.3 for vanilla and 1.58 for permutation/WM alone (Theus et al., 28 Jun 2025).
4. Generalizations and Special Cases
4.1. Layer-wise Connectivity
Even when joint LMC fails, interpolating only one layer at a time rarely induces a barrier. In deep linear networks, layerwise interpolation implies convexity of the loss along that direction (Adilova et al., 2023); in non-linear nets, empirical heatmaps show no layer-wise barrier—especially in early and late layers, with possible exceptions in mid-blocks.
4.2. Mixture-of-Experts, Tree Ensembles, and Other Models
LMC extends beyond standard MLPs/CNNs:
- Mixture-of-Experts (MoE): Mode connectivity is preserved up to permutation of experts and gating functions; matching algorithms efficiently align functional components (Tran et al., 14 Sep 2025).
- Differentiable tree ensembles: Require accounting for subtree flip and split order invariance in addition to tree-permutation, or, for decision-list variants, only tree permutation (Kanoh et al., 2024).
- Sparse Networks: Synthetic-distilled subnetworks exhibit stable, flat, linearly connected basins after pruning, unlike standard sparse or dense networks (McDermott et al., 2023).
4.3. Star-Shaped and Two-Piece Linear Connectivity
For multiple minima, there exists (in overparameterized teacher-student and linear regimes) a center mode such that all minima are two-piece linearly connected to it, rendering the landscape nearly convex (normalized geodesic distance close to 1) (Lin et al., 2024).
5. Implications for Optimization, Model Fusion, and Ensembling
LMC reveals that—even in ostensibly rugged landscapes—practical minima from stochastic optimization are not separated by insurmountable energy ridges but are connected (after symmetry alignment) by flat directions in parameter space. This permits:
- Model merging: Linear interpolation after symmetry matching yields intermediary models with loss indistinguishable from the original endpoints (Akash et al., 2022, Ito et al., 2024).
- Fine-tuning and continual learning: Sequentially found minima can be constrained to lie on the same (possibly linear) path as multitask solutions, reducing catastrophic forgetting (Mirzadeh et al., 2020).
- Ensembling and federated learning: Layer- or block-wise averaging enables efficient fusion without loss barriers (Adilova et al., 2023).
- Analysis of risk and generalization: Flat, connected basins correlate with improved generalization; LMC serves as an indicator of robustness to training noise (Frankle et al., 2019, Hepburn et al., 6 Nov 2025).
6. Theoretical Analyses and Open Problems
- Barriers and Hessians: Second-order approximations predict the loss barrier: , with and the Hessian (Singh et al., 2024).
- Symmetry group topology: The structure and action of determine the number and connectivity of minima; skip connections reduce disconnectedness (Zhao et al., 29 May 2025).
- Failure modes: Unaligned scales or incomplete symmetry removal can induce arbitrarily large barriers; continuous scaling symmetries in homogeneous nets must be factored out (Zhao et al., 29 May 2025).
- Extensions to further architectures: Ongoing work is extending LMC and symmetry analysis to transformers with richer symmetries (beyond permutations), to MoEs, and to differentiable trees, as well as to applications in multitask and federated settings (Theus et al., 28 Jun 2025, Tran et al., 14 Sep 2025, Kanoh et al., 2024).
7. Future Research Directions
Promising avenues include:
- Scaling symmetry-aware alignment to larger and more diverse models (LLama, multi-task transformers).
- Automated identification and exploitation of soft and continuous symmetries (e.g., via Sinkhorn relaxations).
- Analysis of triangle inequalities and transitivity of symmetries for multi-model fusion.
- Detailed study of optimizer dynamics, initialization choices, and their effect on basin selection and LMC (Theus et al., 28 Jun 2025, Altintas et al., 2023).
- Generalization to sequence and tree architectures as well as practical ensembling in the presence of data and domain shifts.
In summary, linear mode connectivity—once elusive and relegated to small or idealized settings—has been rigorously characterized, both empirically and theoretically, across diverse architectures and learning paradigms. Proper consideration of latent parameter symmetries is essential; with this, the loss geometry of modern networks is revealed to be unexpectedly benign, with broad, connected valleys supporting practical innovations in merging, ensembling, and continual learning (Theus et al., 28 Jun 2025).