Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

121 tokens/sec

GPT-4o

9 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Linear Mode Connectivity in Deep Learning

Updated 7 July 2025

Linear Mode Connectivity (LMC) is the observation that independent deep learning model solutions can be connected via a linear path in parameter space while maintaining nearly unchanged loss.
It demonstrates that symmetry-aware alignment of neural network weights removes barriers in non-convex loss landscapes, challenging traditional training intuitions.
LMC has practical applications in model fusion, federated and continual learning, and pruning, offering actionable insights for both theoretical and applied deep learning research.

Linear Mode Connectivity (LMC) is the empirical and theoretical observation that independently optimized solutions of overparameterized machine learning models—most notably deep neural networks—can, after accounting for inherent symmetries, be connected via a straight-line path in parameter space along which the loss remains nearly constant. LMC challenges classical intuitions about highly non-convex loss landscapes, suggesting unexpectedly rich structural properties exploitable for model fusion, continual learning, pruning, and theoretical understanding of deep learning success.

1. Definition and Mathematical Criteria

Let $W^1_T$ and $W^2_T$ denote two independently trained parameter vectors (from the same initialization but different SGD noise, data shuffling, or augmentation). Define the linear interpolation as

$W(\alpha) = \alpha\, W^1_T + (1-\alpha)\, W^2_T, \quad \alpha \in [0,1].$

A pair $(W^1_T, W^2_T)$ exhibits Linear Mode Connectivity if

$\mathcal{L}(W(\alpha)) \approx \mathcal{L}(W^1_T) \approx \mathcal{L}(W^2_T)\quad\forall\, \alpha \in [0,1],$

meaning the loss remains virtually constant along the path. In applied settings, the instability (or barrier) is quantified as

$\text{Instability} = \max_{\alpha \in [0,1]}\left[ \mathcal{L}(W(\alpha)) - \frac{1}{2}(\mathcal{L}(W^1_T) + \mathcal{L}(W^2_T)) \right],$

and connectivity is declared when the instability falls below a small threshold (e.g., $<$ 2%) (1912.05671). Generalizations exist: for $d$ -dimensional parameters, element-wise convex combinations are also considered (2308.11511). In practice, the presence of symmetries—particularly permutation invariance of hidden units—necessitates aligning models prior to interpolation.

2. Geometric and Symmetry-Based Explanations

Permutation Invariance and Generalized Symmetries

Neural networks possess discrete symmetries: permuting the order of hidden units (layers with elementwise activations) does not affect functionality. When unaccounted for, naive interpolation between two solutions results in high loss barriers, but after optimal permutation alignment, the barriers can vanish—even for wide, deep networks (2110.06296, 2503.06001). In modern architectures such as Transformers, symmetry classes include not only permutations but also semi-permutations, orthogonal transformations (for normalization and residual streams), and general invertible maps (2506.22712). These are formally nested:

$\text{Permutation} \subset \text{Semi-Permutation} \subset \text{Orthogonal} \subset \text{Invertible}.$

Formally, for weight matrices $W_1, W_2$ in a feedforward block, permutation symmetry is captured by

$W'_1 = PW_1,\quad W'_2 = W_2P^{-1}$

for permutation matrix $P$ ; for semi-permutations, $P$ is a sparse stochastic matrix. These reparameterizations preserve the network’s function, enabling LMC after symmetry-aware alignment (2506.22712).

Topological and Curvature-Based Perspectives

Parameter space symmetry groups (e.g., $\text{GL}_h(\mathbb{R})^{l-1}$ in $l$ -layer linear nets) induce orbits of loss minima. The number of connected components in the minimum set is determined by the topology of the symmetry group—e.g., $2^{l-1}$ components for a product of general linear groups (2505.23681). Skip connections reduce this number, making mode connectivity more wide-spread.

While symmetry implies the existence of connecting curves of constant loss, linear interpolation between arbitrary minima may nonetheless encounter a barrier—especially when rescaling or stretching symmetries are present. A sufficient condition for approximate LMC is that the minimum set is sufficiently flat (i.e., has low curvature) so that the straight segment between two minima stays close to the minimum set:

$\text{dist}(\theta, L^{-1}(c)) \leq d_{\max} = \frac{1}{\kappa_{\max}(1 - \sqrt{1 - (\kappa_{\max}\|\theta_2 - \theta_1\|_2/2)^2})},$

with $\kappa_{\max}$ as maximum curvature (2505.23681).

3. Empirical Manifestations and Algorithmic Consequences

Stability Under Noise and Early Training

Empirical evidence across vision and LLMs shows that once a model enters certain regions of parameter space—typically early in training—different runs with varying SGD noise or data order converge to minima connected by linear paths (1912.05671). For small-scale models (e.g., LeNet on MNIST), this stability is present at initialization; for large models (e.g., ResNet-50, Inception-v3 on ImageNet), it emerges after 10–20% of training (1912.05671). This “stable regime” is essential for the lottery ticket hypothesis: IMP-selected subnetworks achieve full accuracy only when found after training has entered a linearly connected region (1912.05671).

Layerwise and Featurewise Connectivity

Beyond global parameter interpolation, barrier-free connectivity often emerges at the level of individual layers (“layer-wise linear mode connectivity”) (2307.06966), and—where LMC holds—at the level of internal network features (“layerwise linear feature connectivity,” LLFC) (2307.08286). LLFC typically requires two conditions: (1) weak additivity of activations (e.g., ReLU), and (2) (approximate) commutativity of cross-model transformations at each layer. If these are met, not only outputs but all intermediate hidden states behave linearly under interpolation, explaining the surprising ease of “model stitching.”

Mode Combinability and Transitivity

After permutation alignment, LMC generalizes to “mode combinability”: not only does the line segment connecting two solutions yield low loss, but entire hypercubes (element-wise convex combinations of parameters) yield low-loss models (2308.11511). Robustness to small misalignments in permutation and transitivity of connectivity via a shared basing model are empirically established, making LMC a practical property for multi-model aggregation.

Model Fusion and Federated Learning

LMC underpins layerwise model fusion (2210.06671) and the design of algorithms such as Wasserstein barycenter fusion, which for neural nets (including recurrent and convolutional architectures) aligns and averages weights using transport-based couplings. In federated learning with heterogeneous clients, methods that promote or enforce “group connectivity” (e.g., FedGuCci and FedGuCci+, which use fixed anchor models and connectivity regularization) yield substantial improvement in the generalization of aggregated models (2402.18949). Similarly, training-time neuron alignment (e.g., TNA-PFN) fixes a subset of weights, breaking permutation symmetry early for efficient, scalable model fusion in wide, deep, or decentralized settings (2402.01342).

4. Theoretical Characterizations and Predictors

The occurrence (or failure) of LMC can often be predicted by local loss surface geometry. If the loss along the interpolation path can be well-approximated by a quadratic form, then the barrier can be written as

$B(\alpha) = \frac{\alpha(1-\alpha)}{2} \, (\theta_2 - \theta_1)^\top [\alpha H(\theta_1) + (1-\alpha)H(\theta_2)] (\theta_2 - \theta_1),$

where $H(\cdot)$ is the Hessian of the loss. Larger weighted distances or curvatures correspond to higher barriers (2406.16300). Empirical findings confirm this approximation even for moderate distances between solutions.

Moreover, in two-layer ReLU models, explicit rates for barrier decay as a function of width $m$ are established: with permutation alignment, the loss barrier vanishes as $O(m^{-1/2})$ , a rate that does not suffer from curse of dimensionality (2503.06001). The presence of sparsity induced by higher learning rates further reduces barriers, as non-contributing (zeroed-out) neurons do not impact the interpolation (2503.06001).

5. Generalizations and Extensions

Beyond Neuron Permutations

While neuron permutations suffice in many settings, richer architecture classes—such as Transformers—exhibit additional symmetries, including semi-permutations (block reweightings), orthogonal alignments (for normalized residual streams), and global invertible linear maps (for attention components) (2506.22712). End-to-end learned matching over these groups enables, for the first time, barrier-free LMC in Vision Transformers and GPT-2 (2506.22712).

Loss Landscape Topography

An influential “mountainside and ridge” metaphor models the topography underlying LMC: when training branches/forks early, different SGD trajectories quickly diverge to opposite sides of a ridge, yielding a wide barrier; later forks (when training has descended a common slope) produce child models on the same side of a ridge, thus only separated by low “bumps” and easily connected linearly (2406.16300). Visualization of forked training runs empirically corroborates this analysis, and the associated second-order Taylor expansion accurately predicts observed barriers.

LMC in Non-Neural Architectures

Soft tree ensembles (differentiable versions of decision trees and forests) also exhibit LMC, but only after accounting for architecture-specific invariances: subtree flip invariance (via sign-symmetric functions), splitting order invariance (in oblivious trees), and tree permutation (2405.14596). By designing architectures such as decision lists (eliminating these additional symmetries by construction), LMC can be achieved with minimal matching overhead.

6. Limitations, Contingencies, and Open Problems

LMC is not universal. For example, in NLP finetuning, models started from the same pretrained weights but trained with different seeds may land in disconnected basins, corresponding to distinct generalization strategies (e.g., syntax-aware vs. bag-of-words heuristics), as revealed by significant linear barriers (2205.12411). The geometry of the loss surface, the optimizer (e.g., SGD vs. ADAM), data complexity, architecture (e.g., weight sharing in CNNs), initial training phase, and algorithmic choices all affect the realization of LMC (2312.09832). Additionally, approximate mode connectivity is possible even when exact linear connectivity fails, provided the curvature of connecting curves induced by innate symmetries remains small (2505.23681).

7. Impact and Practical Relevance

LMC has direct implications for:

Model merging and federated learning: Enabling aggregation and fusion of separately trained models with minimal degradation, especially after appropriate alignment (2210.06671, 2402.01342, 2402.18949).
Continual and multitask learning: Providing geometric guidance to mitigate catastrophic forgetting and develop new algorithms (e.g., MC-SGD) (2010.04495).
Pruning and lottery tickets: Explaining why sparse subnetworks found via IMP match full-model performance only in regimes where LMC (i.e., noise stability) is achieved (1912.05671).
Theoretical characterization: Deepening understanding of loss landscape geometry, symmetry-induced connectivity, and the impact of overparameterization.

LMC thus shapes the development and analysis of modern deep learning methods, bringing together algebraic, geometric, algorithmic, and practical perspectives on model space connectivity.