Feature-Learning Dynamics in Neural Networks

Updated 30 July 2025

Feature-learning dynamics are the processes by which neural networks progressively acquire, refine, and adapt representations during training, impacting generalization and task transfer.
They involve mechanisms such as alignment, disalignment, and rescaling that dynamically adjust weight directions and magnitudes, distinguishing them from static kernel regimes.
Understanding these dynamics guides practical strategies in hyperparameter tuning, architecture design, and scaling laws to enhance robustness and performance.

Feature-learning dynamics refer to the temporal evolution and mechanisms by which neural representations, or “features,” are acquired, transformed, and refined during training in machine learning models, particularly in neural networks. Rather than viewing learned representations as static, the paper of feature-learning dynamics aims to rigorously describe, analyze, and exploit the processes through which models extract, retain, and adapt information from data over time. Understanding these dynamics is essential for explaining generalization, optimization behavior, and the transferability of representations across tasks and domains.

1. Theoretical Foundations: Feature Gap and Information-Theoretic Principles

A foundational theoretical framework (Rooyen et al., 2015) casts feature learning as the stochastic mapping of raw data $X$ to a feature space $Z$ via a potentially randomized feature map $P_{Z|X}$ . The downstream task seeks to predict a label $Y$ from $Z$ , with an overall risk given by a loss function $L$ . Central to this theory is the feature gap: $\Delta R_L(P_{XY}, P_{Z|X}) = R_L(P_{ZY}) - R_L(P_{XY})$ which, by Theorem 1, can be written as

$\Delta R_L(P_{XY}, P_{Z|X}) = \mathbb{E}_{x,z} [ D_L(P_{Y|x}, P_{Y|z}) ]$

where $D_L(P,Q)$ quantifies the regret of using the Bayes act from $Q$ in place of $P$ .

Key implications:

Feature extraction and supervised learning are decoupled by this framework; all impact of feature learning on the subsequent prediction task is funneled through the feature gap.
A zero feature gap (i.e., perfect information preservation in features) is equivalent to conditional independence $Y \perp X \mid Z$ by the Blackwell–Sherman–Stein theorem.
Feature map evaluation is connected to rate–distortion theory: the mutual information $I(X;Z)$ lower-bounds the distortion achievable, with higher $I(X;Z)$ implying better potential predictive performance; further generalizations to $f$ -divergences allow for tighter or more loss-specific performance bounds.

For unsupervised representation learning, Theorem 4 proves that reconstruction is both necessary and sufficient: features must support accurate recovery of $X$ (with low expected loss), ensuring the feature gap remains uniformly small for all potential downstream supervised tasks.

2. Micro- and Macro-Dynamical Mechanisms

In explicit network models, feature-learning emerges through dynamic mechanisms distinct from those seen in the kernel regime. Three principal mechanisms have been delineated in one-hidden-layer linear networks (Xu et al., 13 Jan 2024):

Learning by Alignment: Hidden-layer weights and output weights become more similar (“aligned”), quantified for weights $u$ and $w$ by a measure $\zeta(t)$ . The alignment grows over time when the model must increase its output by concentrating capacity along shared directions.
Learning by Disalignment: If weights are initially too aligned (e.g., due to orthogonal initialization), training reduces this alignment to fit the data, effecting necessary orthogonalization for proper feature representation.
Learning by Rescaling: Changes in the norm of hidden and output weights rescale the effect of learned features, dynamically controlling output scaling and representational fidelity.

These mechanisms are absent in the kernel (NTK) regime, where features remain fixed and only output weights adapt, but they are pronounced in finite-width and deep models, as confirmed via empirical studies on nonlinear networks (Xu et al., 13 Jan 2024). The manifestation of these mechanisms depends critically on the choice of network initialization, hyperparameters, and the scaling of learning rates.

3. Sequential and Structured Feature Acquisition

Feature acquisition during training is sequential and often follows a well-defined order dictated by the spectral properties of the data or architecture. For shallow autoencoders trained on high-dimensional data, the principal components are learned one by one, with leading modes converging fastest due to their higher associated eigenvalues (Refinetti et al., 2022). Dynamically, this appears as stepwise reductions in reconstruction error, matching classical Hebbian learning results and supported by temporally separated “plateau and drop” behaviors in learning curves.

In Alternating Gradient Flows (AGF) (Kunin et al., 6 Jun 2025), this phenomenon is formalized by alternating fast utility-maximization steps (where dormant neurons align to the direction of maximal error reduction) and slow cost-minimization steps (in which new features become active, immediately reducing the loss). The sequence in which features are acquired corresponds to dominant singular vectors or principal components for linear networks and to Fourier modes (by coefficient magnitude) in modular addition tasks. Such dynamics can be analytically characterized, predicting the order, timing, and magnitude of loss drops.

4. Task Difficulty, Scaling Laws, and Expressivity

Feature-learning dynamics influence not only training behavior but also neural scaling laws and compute-optimal strategies. Analyses in solvable models (Bordelon et al., 26 Sep 2024), supported empirically by deep networks, reveal that for “hard” tasks (those outside the initial NTK’s RKHS, parameterized by a Fourier or spectral exponent $\beta < 1$ ), feature learning dramatically accelerates convergence. In these cases, the decay exponent of training loss nearly doubles—from $\chi = \beta$ (lazy/kernel) to $\chi = 2\beta/(1+\beta)$ (rich/feature-learning regime). For “easy” tasks within the RKHS ( $\beta > 1$ ), the scaling exponent remains unchanged, and for “super-easy” tasks, optimization is dominated by SGD noise and transient effects.

This insight has direct implications for compute allocation: for hard tasks, optimal compute scaling (i.e., balance of parameter and time scaling for a fixed compute budget) changes due to the improved exponents in the feature-learning regime. Thus, leveraging rich feature-learning dynamics is essential for practical efficiency and performance on complex tasks.

5. Modulating and Diagnosing Feature-Learning Regimes

Feature-learning strength can be externally controlled. Down-scaling final layer outputs via a hyperparameter $\gamma$ determines the degree to which internal representations evolve during training (Atanasov et al., 6 Oct 2024). For small $\gamma$ (lazy regime), the NTK dominates and features remain nearly static; for large $\gamma$ (ultra-rich regime), representations evolve significantly. Key findings include:

The optimal learning rate $\eta^*$ scales as $\gamma^2$ in the lazy regime and as $\gamma^{2/L}$ in the ultra-rich regime (for network depth $L$ ).
In the ultra-rich regime, loss curves show long plateaus (“silent alignment” of features to the task) followed by steep drops, often in a staircase pattern.
Networks with large $\gamma$ (properly tuned) often demonstrate markedly improved performance, but the optimal regime may be missed without deliberate hyperparameter sweeps.

The identical principle appears in warmer restarts and grokking (Lyle et al., 26 Jul 2025): increasing the “effective learning rate” (ratio of update norm to parameter norm) can drive the network from the lazy into the feature-learning regime, triggering substantial changes in representations aligned with better generalization.

Diagnostic metrics involving changes in the feature covariance or activation patterns (e.g., $\Delta_C^\ell$ and $\Delta_A^\ell$ ) allow one to quantitatively detect the onset of rich, nontrivial feature-learning dynamics and distinguish them from the memorizing behavior of the lazy regime.

6. Layerwise Feature Allocation and Macroscopic Models

Feature learning in deep architectures is distributed non-uniformly across layers and shaped by both architectural and optimization parameters. Mechanical analogies, such as the spring–block model (Shi et al., 28 Jul 2024), recast the evolution of data separation across layers as a “load curve” $D_\ell$ , where each layer’s contribution $d_\ell$ is dynamically determined by the interplay of nonlinearity, noise, and friction-like regularization. The resulting noise–nonlinearity phase diagram reveals:

High nonlinearity and low noise create concave load curves: deeper layers learn more.
Increased noise or reduced nonlinearity shifts learning burden to shallower layers (convex load curves).
Optimal generalization is consistently observed for networks whose load curves are approximately linear, i.e., when feature improvement (data separation) is evenly distributed across layers.

This macroscopic perspective provides actionable guidance for controlling training hyperparameters (learning rate, batch size, label noise) and nonlinearity to allocate feature extraction efficiently across the network depth.

7. Influence of Learning Rules and Problem Structure

The type of learning rule and the structure of the data strongly shape the feature-learning dynamics:

In wide networks, the choice between gradient descent, feedback alignment, DFA, or error-modulated Hebbian rules dictates which layers contribute to feature learning, as captured by evolving effective neural tangent kernels and dynamical mean-field theory (Bordelon et al., 2022). For example, lazy DFA and Hebb only adapt the last layer features; rich regimes with gradient descent yield dynamic, depth-dependent kernels.
Data with spurious correlations or with features of differing complexity induces multiscale feature-learning phases. For example, simple spurious features are learned early, forming a persistent “spurious” subnetwork that can retard learning of more complex, core features (Qiu et al., 5 Mar 2024). Interventions such as last-layer retraining can correct the resulting biases, but standard debiasing algorithms often fail when phase separation is weak.

Spectral properties of attention weights in transformers reflect two-stage training: syntactic features are encoded by small eigenvalues (opening achieved early), semantic features by large eigenvalues (acquired late, with gradual learning-rate reduction) (Gong et al., 28 Feb 2025).

8. Practical and Architectural Implications

Explicit architectural patterns (e.g., polynomial growth of hidden layer width) and geometry-aware gradient scaling (Terjék, 18 Feb 2025) enable stable, high-capacity feature learning beyond the edge of stability, permitting networks to exploit rich representations without catastrophic loss explosions.
In diffusion models for generative tasks, the training objective enforces balanced feature learning between signal and noise components, in contrast to discriminative training which aggressively amplifies the easiest (often spurious) components (Han et al., 2 Dec 2024). This balance underpins the robustness and flexibility of the resulting representations.

Models, algorithms, and domain-specific interventions that leverage feature-learning dynamics—by calibrating regularization, initialization, gradient scaling, or optimization schedules—can yield marked improvements in expressivity, generalization, and robustness. Conversely, over-regularization, excessive reliance on lazy regimes, or inattention to data structure can entrench sub-optimal or brittle representations.

9. Summary Table of Key Concepts

Concept/Mechanism	Description	Source
Feature gap ( $\Delta R_L$ )	Quantifies loss due to feature map vs. raw data	(Rooyen et al., 2015)
Sequential principal component learning	Leading modes learned first, in order of their eigenvalues	(Refinetti et al., 2022, Kunin et al., 6 Jun 2025)
Alignment/disalignment/rescaling	Mechanisms for directional and norm changes in features	(Xu et al., 13 Jan 2024)
AGF (Alternating Gradient Flows)	Alternates utility maximization (alignment) and cost minimization (loss drop)	(Kunin et al., 6 Jun 2025)
Noise–nonlinearity phase diagram	Identifies shallow/deep loaded regimes via macroscopic mechanical analogy	(Shi et al., 28 Jul 2024)
Scaling law exponent improvement	Feature learning doubles power-law exponent for hard tasks	(Bordelon et al., 26 Sep 2024)
Effective learning rate	Ratio controlling strength of feature-learning regime	(Atanasov et al., 6 Oct 2024, Lyle et al., 26 Jul 2025)
Load curve ( $D_\ell$ )	Data separation per layer; linear curve observed for best generalization	(Shi et al., 28 Jul 2024)

10. Conclusion

Feature-learning dynamics provide a unified, multiscale perspective on how representations in neural networks are acquired and evolve during training. They subsume microscopic mechanisms (alignment, disalignment, rescaling), sequential acquisition via macroscopic (often analytic) processes, and the interplay between network architecture, learning rules, and data structure. Theoretical and empirical results converge on the necessity of moving beyond lazy kernel regimes toward methods and model designs that foster adaptive, data-dependent feature learning. This understanding underlies recent advances in robust deep learning, scaling laws, and improvements in real-world generalization and transferability.