Feature Learning and Lazy Regimes

Updated 18 April 2026

Feature learning and lazy regimes are distinct training dynamics where one enables evolving internal representations and the other maintains near-constant kernel structures.
They are quantified using metrics like weight-change norm, representation alignment, and tangent-kernel alignment to rigorously classify network behavior.
Adjusting network architecture, initialization, and scaling can interpolate between rapid feature adaptation and stable lazy dynamics, impacting generalization and transfer.

Feature learning and lazy regimes delineate two fundamentally distinct modes by which neural networks acquire representations during training. These regimes differ in the dynamics of internal representations, the degree of tangent-kernel evolution, and their consequences for generalization, transfer, catastrophic forgetting, and biological plausibility. The separation and interpolation between these regimes can be rigorously defined, quantified, and manipulated via architectural parameters, initialization statistics, and scaling limits. This article provides a comprehensive overview structured as follows: theoretical definitions, scaling and phase boundaries, geometric and statistical signatures, practical empirical findings, regime transitions and interpolations, and neuroscientific and algorithmic implications.

1. Theoretical Definitions and Regime Criteria

The lazy (kernel) regime is characterized by certain invariants in the network's learning trajectory. During training, the Neural Tangent Kernel (NTK), denoted $K_t$ , remains essentially unchanged from its initialization value, $K_0$ , such that function evolution is described by a linearized expansion about the initial weights. This situation arises in the infinite-width limit or, equivalently, for very large initialization norm or output scaling, and can be formalized as:

$f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$

with the resulting dynamics

$\dot f(X) = -K_0\,(f(X) - y)\,,$

which yields kernel regression with fixed NTK (Liu et al., 2023, Geiger et al., 2020, Kumar et al., 2023).

By contrast, in the rich (feature-learning) regime, the hidden representations and NTK $K_t$ undergo $\mathcal{O}(1)$ changes during training, leading to genuine feature evolution. The NTK evolves non-trivially and the network states and representations change substantially as a result of gradient descent (Liu et al., 2023, Karkada, 2024, Agazzi et al., 2019).

In practice, regimes can be classified by quantitative post-training metrics such as:

Weight-change norm: $\|\Delta W\|_F = \|W^{(f)} - W^{(0)}\|_F$ (small for lazy).
Representation alignment (RA):

$\mathrm{RA} = \frac{\operatorname{Tr}(R^{(f)}R^{(0)})}{\|R^{(f)}\|_F\|R^{(0)}\|_F}$

where $R = H^T H$ for hidden activations $H$ .

Tangent-kernel alignment (KA):

$K_0$ 0

(Liu et al., 2023).

These collectively enable precise empirical discrimination between regimes.

2. Scaling Laws, Phase Boundaries, and Effective Rank

The regime realized by a neural network is governed by both parameter scalings and the structure of the initial weight matrices:

Width and output scaling: In wide networks, defining $K_0$ 1, lazy regime applies for $K_0$ 2 and feature-learning for $K_0$ 3 (Geiger et al., 2020, Geiger et al., 2019).
Effective rank: The effective rank of a weight matrix (or its covariance) is formulated as

$K_0$ 4

where $K_0$ 5 are singular values. High effective rank at initialization leads to lazier learning, whereas low-rank initialization biases toward rich, feature-learning dynamics. However, low-rank initialization aligned with the task can yield a lazy regime despite low rank (Liu et al., 2023).

λ-balance and layer imbalance: For two-layer linear networks with $K_0$ 6, large $K_0$ 7 produces laziness by freezing one layer, while $K_0$ 8 yields strong interlayer feedback and rich feature learning (Dominé et al., 2024, Nam et al., 28 Feb 2025).
Critical points: The boundary (e.g., $K_0$ 9) delineates a crossover in training dynamics and can be empirically observed as sharp transitions in learning curves or kernel metrics (Geiger et al., 2019, Geiger et al., 2020, Kumar et al., 2023).
Proportional scaling of sample size and network width also governs phase transitions in ensemble and Bayesian settings (Corti et al., 28 Aug 2025, Rubin et al., 5 Feb 2025).

3. Statistical-Mechanics, Geometric, and Dynamical Signatures

The distinction between regimes is visible in both the macroscopic geometry of learned representations and the microscopic statistics of the network parameters.

Manifold geometry and capacity:

Feature learning induces progressive untangling of task-relevant representational manifolds, measurable by manifold capacity, radius $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 0, dimension $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 1, and center/axis alignments (Chou et al., 23 Mar 2025).
In the rich regime, class manifolds become more linearly separable through radius shrinkage, dimension reduction, and decorrelation, as captured by the GLUE metrics (Chou et al., 23 Mar 2025).

Collective and microscopic feature-learning observables:

Manifold separation $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 2 and weight–weight correlation matrices in finite-width Bayesian networks provide collective and microscopic fingerprints of nontrivial feature learning, even when the predictive distribution at outputs is still GP-like (Corti et al., 28 Aug 2025).

Dynamical feedback:

Layerwise linear models expose a dynamical feedback principle: in the rich regime, interlayer feedback amplifies feature evolution, producing nonlinear, stage-like training. In the lazy regime, layer imbalance or large global output scaling suppresses feature feedback, resulting in quasi-linear dynamics and feature stasis (Nam et al., 28 Feb 2025, Dominé et al., 2024, Kunin et al., 2024).

Transition timescale: In many systems, the timescale for significant feature learning scales as $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 3 (for width $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 4, output scale $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 5), marking the onset of departure from lazy training (Geiger et al., 2019, Kumar et al., 2023).

4. Empirical Regime Manifestations and Practical Phenomena

Generalization and sample complexity:

In tasks where the underlying structure (e.g., low-dimensional manifolds, compositionality) is not well-captured by initial random features, lazy regimes yield suboptimal learning, and an unbounded gap in generalization appears unless the network is vastly overparameterized. By contrast, feature learning aligns internal representations with task-relevant directions and drastically reduces sample complexity (Ghorbani et al., 2019, Göring et al., 16 Oct 2025, Kunin et al., 2024).

Learning schedule and example difficulty:

Feature learning preferentially weights easy examples (high c-score, low label noise, or strong spurious correlations) in early stages, producing an implicit curriculum. The lazy regime assigns more uniform weights across example difficulty, lacking this dynamic prioritization (George et al., 2022).

Single-task versus multitask and transfer:

Rich regimes are necessary for the emergence of nested feature reuse and for transfer learning of new features on top of existing embeddings. Lazy or "structured-lazy" regimes that freeze early layers impede transfer when new outputs depend on modified hidden representations (Lippl et al., 2023, Dominé et al., 2024).

Catastrophic forgetting, stability, plasticity:

Rich learning increases catastrophic forgetting by incurring large representational drift on task switches; lazy regimes preserve stability but restrict adaptation (especially in non-stationary continual learning scenarios) (Liu et al., 2023, Graldi et al., 20 Jun 2025).
Critical levels of feature learning exist where accuracy is optimized and the trade-off between stability and plasticity is tuned to task similarity (Graldi et al., 20 Jun 2025).

Phase diagrams, scaling transitions, and "grokking":

Phase diagrams in (width, scaling) or (hyperparameter, initialization) space sharply separate regimes, revealing jamming transitions, double-descent phenomena, and learning phase boundaries (Geiger et al., 2020, Geiger et al., 2019, Yarotsky et al., 4 Feb 2026).
Grokking events are consistently associated with delayed transitions from lazy to feature-learning phases, controlled by output scaling and kernel–target alignment (Kumar et al., 2023, Kunin et al., 2024).

5. Theoretical Synthesis and Regime Interpolation

Recent advances have formalized the continuum between lazy and rich regimes:

There exists a one-dimensional “richness scale” (Editor’s term) parameterized by a single degree of freedom in initialization variance / learning rate / output scaling, continuously tuning the magnitude of hidden-layer update per step, and thus the degree of kernel drift and feature learning (Karkada, 2024).
For any fixed value on this scale (e.g., $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 6 with $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 7, $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 8), the regime interpolates between the exactly tractable NTK endpoint ( $f(x;w) \approx f(x;w_0) + \nabla_w f(x;w_0)\cdot(w-w_0)$ 9) and the fully nonlinear mean-field limit ( $\dot f(X) = -K_0\,(f(X) - y)\,,$ 0) (Karkada, 2024).
Exact solutions for minimal and deep linear models reveal that the "λ-balance" or inter-layer variance ratio completely determines the point on this spectrum and the resulting dynamical properties of feature learning (Dominé et al., 2024, Kunin et al., 2024).

The multi-scale adaptive theory further shows that in linear networks, adaptive learning can often be subsumed at the output level into a rescaling of the initial GP kernel, but true directional feature learning appears as higher-order, non-isotropic corrections in the covariance, and is only fully manifest beyond the infinite-width/lazy limit (Rubin et al., 5 Feb 2025).

6. Implications for Biological and Artificial Systems

In both neuroscience and artificial intelligence, regime selection has profound implications:

Metabolic cost and plasticity: Rich regimes incur high synaptic change (and thus metabolic cost) and risk catastrophic interference, while lazy modes economize plasticity and maintain robust persistent representations (Liu et al., 2023).
Developmental specialization: Biological circuits may develop through a sequence of rich learning (low-rank, flexible), followed by consolidation into higher-rank or task-aligned lazy regimes that favor stability, matching observed development–maturity transitions in animals (Liu et al., 2023).
Task specificity and flexibility: Low-rank, task-aligned circuits can support high-accuracy, stable specialization, whereas high-rank circuits provide broad flexibility at the expense of per-task plasticity (Liu et al., 2023).
Applications in architecture and initialization: Upstream unbalanced initializations in practical deep networks (i.e., overparameterizing early layers or scaling their learning rates/variances) promote rapid feature alignment, accelerate grokking, and enhance interpretability of early feature detectors in vision tasks (Kunin et al., 2024).

Decisions about initialization, scaling, architecture, and training schedule can thus be systematically informed to target desired positions on the lazy–rich spectrum, with explicit trade-offs between rapid adaptation, transfer, stability, and efficiency.

7. Beyond the Dichotomy: Subtypes and Future Directions

The classical lazy–rich dichotomy is insufficient to describe the nuanced spectrum of representational adaptation observed in contemporary neural systems:

Subtypes of feature learning arise depending on architecture, learning rules, and data structure, including radius-dominated, dimension-dominated, and sacrificial (margin-for-dimension) strategies in representational manifold geometry (Chou et al., 23 Mar 2025).
Nested feature selection and variable degrees of feature reuse between tasks, particularly in pretraining+finetuning contexts, establish new regimes not reducible to the lazy or rich archetypes (Lippl et al., 2023).
Phase transitions in spiked random matrix models, multi-scale adaptive Bayesian frameworks, and diagrammatic expansions provide mathematical tools to further refine and interpolate between regimes, with precise critical points, explicit solutions, and quantifiable transition rates (Corti et al., 28 Aug 2025, Yarotsky et al., 4 Feb 2026).

These directions promise a systematic, predictive theory of deep learning dynamics rooted in rigorous distinctions between lazy and feature-learning regimes and their many interpolated, biologically and practically relevant variants.