Feature Learning vs Lazy Regimes

Updated 27 November 2025

Feature learning vs lazy regimes are distinct neural network training phases: lazy retains near-initial features, while feature learning adapts dynamically.
Lazy regimes yield fixed-kernel behavior with stable, yet limited, generalization, whereas adaptive regimes enhance representation geometry and task-specific performance.
Transitions between these regimes depend on scaling, dataset size, and initialization, influencing generalization, continual learning, and overall model robustness.

Feature learning and lazy regimes constitute two sharply distinct, theoretically tractable phases of learning dynamics in neural networks. These regimes emerge from different scalings of network width, initialization, and architecture, and manifest in distinct geometric, statistical, and generalization behaviors. Contemporary research has provided precise mathematical and statistical-mechanical characterizations, revealed collective and microscopic signatures, established transitions between the regimes, and connected these with core phenomena such as generalization, representation geometry, continual learning, and transfer.

1. Formal Definitions and Regime Boundaries

The “lazy” (or Gaussian-process/NTK) regime is characterized by network parameters that remain infinitesimally close to their initialization throughout training. In this limit (often realized as $N_1\to\infty$ at fixed, small dataset size $P$ , so $P/N_1\to0$ ), the feature activations $\sigma(h^\mu_i)$ experience only $\mathcal{O}(1/N_1)$ fluctuations around their random initialization. The posterior predictive distribution converges to that of a Gaussian process determined by the neural tangent kernel (NTK), $K_{\mu\nu} = \mathbb{E}_w[\sigma(h^\mu)\sigma(h^\nu)]$ , and parameter updates produce negligible movement—no nontrivial feature learning occurs (Corti et al., 28 Aug 2025, Dominé et al., 22 Sep 2024, Rubin et al., 5 Feb 2025, Karkada, 30 Apr 2024).

The “feature-learning” (proportional-width or mean-field) regime requires both $N_1, P\to\infty$ at fixed $\alpha = P/N_1 > 0$ . Now, finite, data-dependent weight updates induce $\mathcal{O}(1)$ corrections to both the hidden-layer feature correlations and weight covariances. The posterior predictive remains a GP regression, but with a renormalized kernel $K_{(R)} = \bar Q K/\lambda_1$ , where $\bar Q$ solves a nontrivial saddle-point equation enforcing data dependence. The key boundary between regimes is therefore set by the load parameter $\alpha$ ; increasing $\alpha$ drives the system away from the lazy phase into rich feature adaptation (Corti et al., 28 Aug 2025, Dominé et al., 22 Sep 2024, Nam et al., 28 Feb 2025, Rubin et al., 5 Feb 2025).

2. Mathematical Characterization: Macroscopic and Microscopic Signatures

Collective (Geometric) Signatures

The geometry of internal representations distinguishes the regimes. In a binary classification setup, the squared distance $D^2$ between class centroids in feature space,

$D^2 = \left\| \frac{2}{P}\sum_{\mu:y^\mu=+1}\sigma(h^\mu) - \frac{2}{P}\sum_{\mu:y^\mu=-1}\sigma(h^\mu) \right\|^2,$

has a posterior average that separates into a “prior” term (present in the lazy regime) and “feature-learning” corrections surviving at $\mathcal{O}(1)$ only in the proportional limit ( $\alpha>0$ ) (Corti et al., 28 Aug 2025). Notably, $\langle D^2\rangle$ in the feature-learning regime displays a nonmonotonic dependence on regularization/temperature $T$ , peaking at an inversion temperature $T^*(\alpha)$ , whereas in the lazy regime, $D^2$ is monotonic in regularization and independent of $T$ . Geometric effects such as the sharpening and untangling of class manifolds thus provide observable macroscopic fingerprints of feature learning (Corti et al., 28 Aug 2025, Chou et al., 23 Mar 2025).

Microscopic (Weight) Signatures

The feature-learning regime induces pattern-dependent two-point correlations among weight parameters that are absent in lazy (NTK) networks. In the lazy limit ( $\alpha\to0$ ), first-layer weights $w_{1h}$ remain independent, with covariances $\langle w_{1h}w_{1k}\rangle \approx \delta_{hk}/\lambda_0$ ; all cross-neuron correlations vanish. For $\alpha>0$ , nontrivial data-dependent corrections $\Lambda_{1,2} = \mathcal{O}(\alpha)$ yield new, genuine weight correlations. This microscopic displacement is a hallmark of feature learning, not present in kernel methods (Corti et al., 28 Aug 2025).

3. Regime Transitions, Scaling Laws, and Phase Diagrams

Regime transitions are controlled by global and relative scalings of network width, dataset size, and initialization, with critical points and continuous crossovers.

In fully-connected and deep linear nets, a balancing parameter $\lambda$ (difference in squared layer norms) interpolates between strictly lazy (large $|\lambda|$ ) and rich (balanced, $\lambda\approx0$ ) regimes; see exact closed-form Riccati solutions and “sigmoidal” versus exponential singular-value dynamics (Dominé et al., 22 Sep 2024, Nam et al., 28 Feb 2025).
In practice, the boundary between regimes is sharply controlled by the scaling variable $\alpha = P/N_1$ (or, in other conventions, by a “richness” parameter $r$ that tunes hidden-feature changes as $N^r$ for width $N$ ). The regime $\alpha\ll1$ (or $r\to0$ ) is lazy/NTK, while $\alpha=\mathcal{O}(1)$ (or $r\to1/2$ ) is rich/μP (Karkada, 30 Apr 2024).
Empirical phase diagrams constructed in the $(h,\alpha)$ plane (with $h$ width, $\alpha$ initialization/output scale) exhibit a “jamming” critical line between under-/over-parameterization, and within the overparameterized phase, a diagonal boundary at $\alpha\sqrt h = \mathcal{O}(1)$ separating lazy from feature-learning behavior (Geiger et al., 2020, Geiger et al., 2019).

4. Connections to Generalization, Capacity, and Representation Geometry

The lazy regime strictly limits the generalization power: the induced predictor is equivalent to fixed-kernel (NTK or random feature) regression. Notably, fully trained networks can generalize in underparameterized settings (e.g., $m<d$ for $m$ neurons, $d$ dimensions) by adapting to the top singular directions or principal components of the signal, a success impossible for purely lazy or random-feature models (Ghorbani et al., 2019). Generalization error scaling and capacity are sensitive to the richness of learned features, and tasks with strong anisotropy or low-rank structure can benefit dramatically from entering the rich phase (Chou et al., 23 Mar 2025, Corti et al., 28 Aug 2025).

Representational geometry provides quantitative measures that can distinguish regimes even when weight or NTK norms vary little. Capacity metrics (mean-field and simulated), effective manifold dimension, and manifold alignment parameters can reveal the emergence, subtype, and dynamics of feature adaptation. Rich regimes drive flattening and tightening of manifolds, multi-stage learning of structural features, and outlier eigenmodes in kernel spectra; lazy regimes preserve initialization geometry (Chou et al., 23 Mar 2025, Naveh et al., 2021).

5. Regime Impact: Dynamics, Continual Learning, and Grokking

The regime choice deeply affects learning dynamics, robustness, and continual learning properties.

Lazy networks train as pure kernel machines: convergence is exponential, representation is fixed, and generalization is fully described by the initial kernel (Nam et al., 28 Feb 2025, Karkada, 30 Apr 2024).
In rich regimes, training is non-linear (“sigmoidal” in feature saliency), features evolve in staged fashion, and generalization adapts to task structure (Dominé et al., 22 Sep 2024).
Feature-learning can drive instability, especially in non-stationary settings, as in continual learning catastrophic forgetting. Scaling laws show that optimal performance arises at a critical “sweet spot” in feature learning: excessive laziness leads to high bias, but excessive feature learning amplifies instability and forgetting. The stability–plasticity trade-off is tunable, with a critical richness level (e.g., $\gamma_0^*$ ) transferable across widths (Graldi et al., 20 Jun 2025).
“Grokking”—a delayed generalization phenomenon—arises from a network first behaving lazily, fitting the train set with kernel regression, then transitioning to genuine feature adaptation and out-of-sample generalization. Grokking is controlled by initial laziness, kernel–task misalignment, and intermediate dataset size (Kumar et al., 2023).

6. Theoretical Approaches and Predictive Frameworks

Advanced statistical-mechanical methods and exact mean-field theories have provided a unified framework spanning both regimes.

Saddle-point and replica calculations yield closed-form descriptors of both macroscopic order parameters and microscopic correlations, tying together kernel adaptation, feature displacement, and global geometric change (Corti et al., 28 Aug 2025, Rubin et al., 5 Feb 2025, Göring et al., 16 Oct 2025).
Self-consistent theories in finite-width networks (e.g., CNNs) display sharp, sometimes discontinuous, transitions between feature and lazy regimes as a function of architecture or width; kernel spectra exhibit BBP-like outlier emergence at these transitions (Naveh et al., 2021).
Hierarchical and continual learning analyses reveal optimality at intermediate richness, with scaling laws and phase boundaries precisely demarcated (Graldi et al., 20 Jun 2025).
The correspondence between the posterior in the feature-learning regime and kernel regression with a renormalized, data-dependent kernel parameter is established, bridging mean-field and kernel perspectives (Rubin et al., 5 Feb 2025).

7. Practical Implications and Open Directions

Overparameterization alone ( $N_1\gg P$ ) is not sufficient for rich feature learning; proportional scaling ( $P/N_1$ finite) and careful tuning of initialization or learning-rate structure are necessary (Corti et al., 28 Aug 2025, Kunin et al., 10 Jun 2024).
Feature learning’s benefits are architecture- and data-dependent: convolutional nets with hierarchical inductive bias gain more from rich regimes than do fully-connected networks on globally smooth tasks, where lazy/NTK training can outperform (Geiger et al., 2020, Petrini et al., 2022).
In multi-task and transfer learning, intermediate regimes (“nested feature-selection”) not fully covered by the classical lazy/rich dichotomy arise, characterized by a balance between feature reuse and sparsity (Lippl et al., 2023).
Rigorous theory continues to refine understanding of symmetry breaking, coding schemes, and microscopic mechanisms in wide, deep, and stochastic networks (Göring et al., 16 Oct 2025, Meegen et al., 24 Jun 2024).

Key References (arXiv ids):

(Corti et al., 28 Aug 2025) Microscopic and collective signatures of feature learning in neural networks
(Dominé et al., 22 Sep 2024) From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks
(Chou et al., 23 Mar 2025) Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
(Nam et al., 28 Feb 2025) Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
(Graldi et al., 20 Jun 2025) The Importance of Being Lazy: Scaling Limits of Continual Learning
(Rubin et al., 5 Feb 2025) From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning
(Göring et al., 16 Oct 2025) A simple mean field model of feature learning
(Karkada, 30 Apr 2024) The lazy (NTK) and rich ( $\mu$ P) regimes: a gentle tutorial
(Kunin et al., 10 Jun 2024) Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning
(Naveh et al., 2021) A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs
(Geiger et al., 2020) Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training
(Geiger et al., 2019) Disentangling feature and lazy training in deep neural networks
(Petrini et al., 2022) Learning sparse features can lead to overfitting in neural networks
(George et al., 2022) Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty
(Atanasov et al., 2022) The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes
(Lippl et al., 2023) Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse
(Ghorbani et al., 2019) Limitations of Lazy Training of Two-layers Neural Networks

These sources collectively provide a comprehensive technical foundation for understanding, quantifying, and exploiting the distinction between feature learning and lazy regimes in deep learning theory and practice.