Phase Transitions between Accuracy Regimes in L2 regularized Deep Neural Networks (2505.06597v2)

Published 10 May 2025 in cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, and physics.data-an

Abstract: Increasing the L2 regularization of Deep Neural Networks (DNNs) causes a first-order phase transition into the under-parametrized phase -- the so-called onset-of learning. We explain this transition via the scalar (Ricci) curvature of the error landscape. We predict new transition points as the data complexity is increased and, in accordance with the theory of phase transitions, the existence of hysteresis effects. We confirm both predictions numerically. Our results provide a natural explanation of the recently discovered phenomenon of '\emph{grokking}' as DNN models getting stuck in a local minimum of the error surface, corresponding to a lower accuracy phase. Our work paves the way for new probing methods of the intrinsic structure of DNNs in and beyond the L2 context.

Summary

The paper shows that L2 regularization drives a phase transition, shifting the DNN error landscape and triggering an onset-of-learning.
It employs geometric analysis using Ricci curvature and experimental testing on a two-hidden-layer network to highlight transitions at specific β values.
Findings reveal grokking and hysteresis effects under high data complexity, offering insights for enhanced training control in neural networks.

Detailed Summary of "Phase Transitions between Accuracy Regimes in L2 Regularized Deep Neural Networks" (2505.06597)

Introduction

The paper investigates the phenomena associated with L2 regularization in deep neural networks (DNNs), focusing on phase transitions in their error and loss landscapes. It provides a geometric interpretation of phase transitions during the training of DNNs, characterizing these transitions using the Ricci curvature of the error landscape. The study extends existing works by predicting additional transitions when data complexity increases and elucidates the role of geometry in grokking.

Geometric Interpretation of Phase Transitions

L2 regularization modifies the loss landscape, acting as a force that shifts the global minimum towards the origin, effectively acting upon the parameter space. The paper describes a first-order phase transition termed "onset-of-learning", where DNNs transition from a useful model to a trivial one as regularization strength $\beta$ increases.

Figure 1: Sketch of a loss landscape with increasing L2 regularizer strength $\beta$ .

This transition is characterized by a sudden shift in parameters, interpreted as the model being pushed out of an error basin, and can be observed as a jump in the error surface metrics.

Experimental Verification

The authors present experimental results that confirm the phase transitions in DNNs. A two-hidden-layer NN is considered, trained with varying $\beta$ values illustrating transitions in mean squared error, curvature, and parameter distance.

Figure 2: Phase transitions in a two-layer neural network with differing regularization strengths, showing transitions related to accuracy and error surface geometry.

The results show not only the onset-of-learning at $\beta_0$ but also an additional transition at $\beta_1$ , pointing to complex interactions between regularization and data complexity.

Grokking and Hysteresis

Grokking, described as delayed convergence due to being stuck in a local minima, is explored as a hysteresis effect. This is especially pronounced in high-complexity scenarios, where models initially enter a low-accuracy phase before eventually achieving higher accuracy.

Figure 3: Mean-Squared-Error over Epochs for NNs with different initializations showing delayed convergence due to hysteresis.

The paper illustrates how grokking can be induced by initializing models within different accuracy phases, displaying the time-dependent nature of neural network training dynamics as controlled by error basins.

Conclusion

The research provides a comprehensive geometric framework for understanding phase transitions and hysteresis in L2 regularized DNNs. Through theoretical analysis complemented by numerical experiments, the paper deepens the understanding of how regularization interacts with data complexity. Future research directions include further characterizing the global error landscape and addressing methods to mitigate undesirable effects like grokking during neural network training. The study proposes potential for harnessing these insights to refine model training efficiency and controllability in complex learning environments.