Linear Mode Connectivity in Deep Learning
- Linear Mode Connectivity (LMC) is the observation that independent deep learning model solutions can be connected via a linear path in parameter space while maintaining nearly unchanged loss.
- It demonstrates that symmetry-aware alignment of neural network weights removes barriers in non-convex loss landscapes, challenging traditional training intuitions.
- LMC has practical applications in model fusion, federated and continual learning, and pruning, offering actionable insights for both theoretical and applied deep learning research.
Linear Mode Connectivity (LMC) is the empirical and theoretical observation that independently optimized solutions of overparameterized machine learning models—most notably deep neural networks—can, after accounting for inherent symmetries, be connected via a straight-line path in parameter space along which the loss remains nearly constant. LMC challenges classical intuitions about highly non-convex loss landscapes, suggesting unexpectedly rich structural properties exploitable for model fusion, continual learning, pruning, and theoretical understanding of deep learning success.
1. Definition and Mathematical Criteria
Let and denote two independently trained parameter vectors (from the same initialization but different SGD noise, data shuffling, or augmentation). Define the linear interpolation as
A pair exhibits Linear Mode Connectivity if
meaning the loss remains virtually constant along the path. In applied settings, the instability (or barrier) is quantified as
and connectivity is declared when the instability falls below a small threshold (e.g., 2%) (Frankle et al., 2019). Generalizations exist: for -dimensional parameters, element-wise convex combinations are also considered (Csiszárik et al., 2023). In practice, the presence of symmetries—particularly permutation invariance of hidden units—necessitates aligning models prior to interpolation.
2. Geometric and Symmetry-Based Explanations
Permutation Invariance and Generalized Symmetries
Neural networks possess discrete symmetries: permuting the order of hidden units (layers with elementwise activations) does not affect functionality. When unaccounted for, naive interpolation between two solutions results in high loss barriers, but after optimal permutation alignment, the barriers can vanish—even for wide, deep networks (Entezari et al., 2021, Zhan et al., 8 Mar 2025). In modern architectures such as Transformers, symmetry classes include not only permutations but also semi-permutations, orthogonal transformations (for normalization and residual streams), and general invertible maps (Theus et al., 28 Jun 2025). These are formally nested:
Formally, for weight matrices in a feedforward block, permutation symmetry is captured by
for permutation matrix ; for semi-permutations, is a sparse stochastic matrix. These reparameterizations preserve the network’s function, enabling LMC after symmetry-aware alignment (Theus et al., 28 Jun 2025).
Topological and Curvature-Based Perspectives
Parameter space symmetry groups (e.g., in -layer linear nets) induce orbits of loss minima. The number of connected components in the minimum set is determined by the topology of the symmetry group—e.g., components for a product of general linear groups (Zhao et al., 29 May 2025). Skip connections reduce this number, making mode connectivity more wide-spread.
While symmetry implies the existence of connecting curves of constant loss, linear interpolation between arbitrary minima may nonetheless encounter a barrier—especially when rescaling or stretching symmetries are present. A sufficient condition for approximate LMC is that the minimum set is sufficiently flat (i.e., has low curvature) so that the straight segment between two minima stays close to the minimum set:
with as maximum curvature (Zhao et al., 29 May 2025).
3. Empirical Manifestations and Algorithmic Consequences
Stability Under Noise and Early Training
Empirical evidence across vision and LLMs shows that once a model enters certain regions of parameter space—typically early in training—different runs with varying SGD noise or data order converge to minima connected by linear paths (Frankle et al., 2019). For small-scale models (e.g., LeNet on MNIST), this stability is present at initialization; for large models (e.g., ResNet-50, Inception-v3 on ImageNet), it emerges after 10–20% of training (Frankle et al., 2019). This “stable regime” is essential for the lottery ticket hypothesis: IMP-selected subnetworks achieve full accuracy only when found after training has entered a linearly connected region (Frankle et al., 2019).
Layerwise and Featurewise Connectivity
Beyond global parameter interpolation, barrier-free connectivity often emerges at the level of individual layers (“layer-wise linear mode connectivity”) (Adilova et al., 2023), and—where LMC holds—at the level of internal network features (“layerwise linear feature connectivity,” LLFC) (Zhou et al., 2023). LLFC typically requires two conditions: (1) weak additivity of activations (e.g., ReLU), and (2) (approximate) commutativity of cross-model transformations at each layer. If these are met, not only outputs but all intermediate hidden states behave linearly under interpolation, explaining the surprising ease of “model stitching.”
Mode Combinability and Transitivity
After permutation alignment, LMC generalizes to “mode combinability”: not only does the line segment connecting two solutions yield low loss, but entire hypercubes (element-wise convex combinations of parameters) yield low-loss models (Csiszárik et al., 2023). Robustness to small misalignments in permutation and transitivity of connectivity via a shared basing model are empirically established, making LMC a practical property for multi-model aggregation.
Model Fusion and Federated Learning
LMC underpins layerwise model fusion (Akash et al., 2022) and the design of algorithms such as Wasserstein barycenter fusion, which for neural nets (including recurrent and convolutional architectures) aligns and averages weights using transport-based couplings. In federated learning with heterogeneous clients, methods that promote or enforce “group connectivity” (e.g., FedGuCci and FedGuCci+, which use fixed anchor models and connectivity regularization) yield substantial improvement in the generalization of aggregated models (Li et al., 29 Feb 2024). Similarly, training-time neuron alignment (e.g., TNA-PFN) fixes a subset of weights, breaking permutation symmetry early for efficient, scalable model fusion in wide, deep, or decentralized settings (Li et al., 2 Feb 2024).
4. Theoretical Characterizations and Predictors
The occurrence (or failure) of LMC can often be predicted by local loss surface geometry. If the loss along the interpolation path can be well-approximated by a quadratic form, then the barrier can be written as
where is the Hessian of the loss. Larger weighted distances or curvatures correspond to higher barriers (Singh et al., 24 Jun 2024). Empirical findings confirm this approximation even for moderate distances between solutions.
Moreover, in two-layer ReLU models, explicit rates for barrier decay as a function of width are established: with permutation alignment, the loss barrier vanishes as , a rate that does not suffer from curse of dimensionality (Zhan et al., 8 Mar 2025). The presence of sparsity induced by higher learning rates further reduces barriers, as non-contributing (zeroed-out) neurons do not impact the interpolation (Zhan et al., 8 Mar 2025).
5. Generalizations and Extensions
Beyond Neuron Permutations
While neuron permutations suffice in many settings, richer architecture classes—such as Transformers—exhibit additional symmetries, including semi-permutations (block reweightings), orthogonal alignments (for normalized residual streams), and global invertible linear maps (for attention components) (Theus et al., 28 Jun 2025). End-to-end learned matching over these groups enables, for the first time, barrier-free LMC in Vision Transformers and GPT-2 (Theus et al., 28 Jun 2025).
Loss Landscape Topography
An influential “mountainside and ridge” metaphor models the topography underlying LMC: when training branches/forks early, different SGD trajectories quickly diverge to opposite sides of a ridge, yielding a wide barrier; later forks (when training has descended a common slope) produce child models on the same side of a ridge, thus only separated by low “bumps” and easily connected linearly (Singh et al., 24 Jun 2024). Visualization of forked training runs empirically corroborates this analysis, and the associated second-order Taylor expansion accurately predicts observed barriers.
LMC in Non-Neural Architectures
Soft tree ensembles (differentiable versions of decision trees and forests) also exhibit LMC, but only after accounting for architecture-specific invariances: subtree flip invariance (via sign-symmetric functions), splitting order invariance (in oblivious trees), and tree permutation (Kanoh et al., 23 May 2024). By designing architectures such as decision lists (eliminating these additional symmetries by construction), LMC can be achieved with minimal matching overhead.
6. Limitations, Contingencies, and Open Problems
LMC is not universal. For example, in NLP finetuning, models started from the same pretrained weights but trained with different seeds may land in disconnected basins, corresponding to distinct generalization strategies (e.g., syntax-aware vs. bag-of-words heuristics), as revealed by significant linear barriers (Juneja et al., 2022). The geometry of the loss surface, the optimizer (e.g., SGD vs. ADAM), data complexity, architecture (e.g., weight sharing in CNNs), initial training phase, and algorithmic choices all affect the realization of LMC (Altintas et al., 2023). Additionally, approximate mode connectivity is possible even when exact linear connectivity fails, provided the curvature of connecting curves induced by innate symmetries remains small (Zhao et al., 29 May 2025).
7. Impact and Practical Relevance
LMC has direct implications for:
- Model merging and federated learning: Enabling aggregation and fusion of separately trained models with minimal degradation, especially after appropriate alignment (Akash et al., 2022, Li et al., 2 Feb 2024, Li et al., 29 Feb 2024).
- Continual and multitask learning: Providing geometric guidance to mitigate catastrophic forgetting and develop new algorithms (e.g., MC-SGD) (Mirzadeh et al., 2020).
- Pruning and lottery tickets: Explaining why sparse subnetworks found via IMP match full-model performance only in regimes where LMC (i.e., noise stability) is achieved (Frankle et al., 2019).
- Theoretical characterization: Deepening understanding of loss landscape geometry, symmetry-induced connectivity, and the impact of overparameterization.
LMC thus shapes the development and analysis of modern deep learning methods, bringing together algebraic, geometric, algorithmic, and practical perspectives on model space connectivity.