Feature-Learning Regime Overview

Updated 21 April 2026

Feature-Learning Regime is a neural network training phase where hidden representations are actively adapted by tuning output scaling and learning rate.
This regime leverages architectural parameters like γ to transition from static NTK behavior to dynamic feature evolution, improving generalization.
Empirical guidelines identify regime shifts via loss plateaus, kernel adaptation, and stability metrics, aiding effective network optimization.

A feature-learning regime denotes the operational phase of a neural network in which training induces substantial evolution of hidden representations—contrasting with "lazy" or "kernel" regimes where features remain effectively fixed and the learning dynamics are described by kernel methods. The transition between these regimes is sharply controlled by architectural parameterizations (notably, the output scaling $\gamma$ in maximal-update or $\mu$ P scaling) and key optimization hyperparameters such as the learning rate $\eta$ . The feature-learning regime is characterized by rich dynamics, distinct loss curves, directional kernel adaptation, and significant improvements in generalization, especially for tasks outside the span of the initial neural tangent kernel (NTK) (Atanasov et al., 2024, Rubin et al., 5 Feb 2025, Bordelon et al., 2024).

1. Parameterization and Phase Demarcation: The Role of Output Scaling

The regime in which a neural network operates—lazy/kernel-like versus rich/feature-learning—is principally set by the final-layer scaling hyperparameter $\gamma$ . In maximal-update ( $\mu$ P) scaling, the network output is normalized:

$\tilde f(x; \theta) = \frac{1}{\gamma} f(x; \theta)$

Lazy regime ( $\gamma \rightarrow 0$ ): The network evolves in a small neighborhood of initialization; the NTK remains essentially constant and feature evolution scales as $O(\gamma)$ , resulting in negligible adaptation of hidden representations.
Feature-learning ("rich") regime ( $\gamma \gg 1$ ): Hidden representations undergo $O(1)$ changes; the network must significantly adjust internal weights to fit the data, driving strong feature learning.

Tuning $\mu$ 0—along with appropriately scaling the learning rate $\mu$ 1—transitions the network between these regimes, with $\mu$ 2 acting as a feature-learning strength knob (Atanasov et al., 2024).

2. Theoretical Scaling Laws and Regime Map

The mathematical structure of the feature-learning regime is concretely specified via scaling laws for the optimal learning rate as a function of $\mu$ 3 and network depth $\mu$ 4.

For a feed-forward depth- $\mu$ 5 network:

In the underparameterized/lazy regime ( $\mu$ 6), stability and optimizing step constraints yield $\mu$ 7.
In the feature-rich regime ( $\mu$ 8), stability requires $\mu$ 9. The optimizable window for $\eta$ 0 is $\eta$ 1.

These bounds define phase regions in the $\eta$ 2 plane:

No-training: $\eta$ 3, loss does not decrease.
Lazy-kernel: $\eta$ 4, $\eta$ 5.
Catapult (MSE only): $\eta$ 6, loss briefly diverges before settling.
Rich-feature-learning: $\eta$ 7, $\eta$ 8 (Atanasov et al., 2024).

3. Empirical and Analytical Phenomenology in the Rich Regime

Networks tuned into the ultra-rich ( $\eta$ 9) regime exhibit distinctive optimization and loss curves:

Long initial loss plateau: "Silent alignment"—representations reorient with respect to the data manifold, but external metrics (loss) change little.
Sudden loss drop-off: Once alignment is achieved, loss rapidly decreases.
Staircase decay: Multiple abrupt loss drops can occur (notably in deep or certain linear networks).

Loss curves for different $\gamma$ 0 collapse onto a universal trajectory under the rescaled time $\gamma$ 1 at early times, revealing an underlying time-reparameterization invariance of the rich regime. The duration of the plateau $\gamma$ 2 sets the trade-off between feature complexity and practical learning under a step-budget constraint (Atanasov et al., 2024).

4. Feature Kernel Adaptation: Scalar Rescaling vs. Directional Deformation

Analytical approaches expose two perspectives on how kernels adapt in the feature-learning regime (Rubin et al., 5 Feb 2025):

Kernel-rescaling theories: In linear networks and in the infinite-width mean-field limit, integration of the posterior yields a solution mathematically equivalent to kernel regression with a scalar-rescaled NNGP kernel:

$\gamma$ 3

Thus, on the mean output, feature learning appears as simple amplitude rescaling.

Adaptive kernel theories: The more general case, especially for finite width and nonlinear networks, reveals directional deformation of the kernel—selectively amplifying certain modes aligned with the data and residuals (i.e., rank- $\gamma$ 4 corrections). The posterior covariance of the learned function includes rank-one (or higher) corrections that are missed by rescaling-only views, and these corrections directly correlate with the degree of feature learning (Rubin et al., 5 Feb 2025).

Directional feature learning is identified by projecting the output covariance onto candidate "features" and observing selective enhancement along task-relevant directions.

5. Feature-Learning Regime, Scaling Laws, and Task Difficulty

Feature learning transforms the scaling laws of neural network generalization, with the degree of improvement contingent on the task's alignment with the initial NTK RKHS (Bordelon et al., 2024):

Easy/super-easy tasks (within RKHS): Feature learning does not alter scaling exponents; test loss decays as $\gamma$ 5 with $\gamma$ 6 unchanged across lazy and rich regimes.
Hard tasks (outside RKHS): Feature learning nearly doubles the decay exponent from $\gamma$ 7 (lazy) to $\gamma$ 8 in the rich regime.

This acceleration is attributed to the evolving kernel norm, which boosts the learned-mode bandwidth. The improved exponent directly alters the compute-optimal trade-off between model size and training time.

6. Empirical Fingerprints, Regime Detection, and Practical Guidelines

Several metrics and structural observables diagnose entry into the feature-learning regime:

Feature movement: $\gamma$ 9 signals NTK evolution.
Eigenfeature and minimum-projection (CKA) metrics: Sharp transition in $\mu$ 0 and effective-rank identifies the minimal-feature regime, especially in vision models (Nam et al., 2024).
Soft rank: Stable or growing soft-rank of hidden feature matrices under SGD signals ongoing feature learning (Terjék, 18 Feb 2025).

Tuning advice for practitioners:

Always sweep both $\mu$ 1 and appropriately scaled $\mu$ 2 (by $\mu$ 3 or $\mu$ 4 as regime dictates).
For maximal feature benefits, choose largest $\mu$ 5 for which the initial plateau remains within total training-hour constraints.
Avoid $\mu$ 6 values that induce loss catapults or instability in low- $\mu$ 7 settings (Atanasov et al., 2024).

7. Broader Theoretical Landscape and Transfer Learning Implications

Feature-learning regimes admit rigorous characterization in mean-field/Bayesian frameworks, where phase transitions in order parameters correspond to the onset and qualitative strength of feature learning (Göring et al., 16 Oct 2025). At finite width, feature learning emerges via symmetry breaking, or through mechanisms such as self-reinforcing input feature selection (ARD), which removes ambient dimension dependence and compresses sample complexity thresholds to the intrinsic task dimension.

Transfer learning in the feature-learning regime, as opposed to the lazy regime, is governed by adapted feature kernels that depend on both source and target data/labels, with explicit interpolation (via elastic coupling penalties) between feature reuse and full re-learning (Lauditi et al., 6 Jul 2025). Empirical results confirm that strong feature learning and optimal transfer occur when hidden representations are sufficiently plastic to adapt to the downstream task, governed by $\mu$ 8 (feature strength) and the coupling parameter.

References: