Grokking at the Edge of Numerical Stability (2501.04697v2)

Published 8 Jan 2025 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the na\"ive loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and $\perp$Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.

Summary

The paper demonstrates that floating-point errors in Softmax lead to 'Softmax Collapse', hindering grokking without regularization.
It introduces StableMax, a numerically stable activation function, and ⊥Grad, an optimizer designed to curb Naïve Loss Minimization.
Empirical results on modular arithmetic and Sparse Parity datasets confirm that mitigating numerical instabilities restores model generalization.

Analysis of "Grokking at the Edge of Numerical Stability"

The paper "Grokking at the Edge of Numerical Stability" addresses a perplexing phenomenon termed "grokking," characterized by sudden generalization in deep learning models following an extended phase of overfitting. Prior research has tied grokking to regularization techniques, particularly weight decay, prompting a broader interest in the mechanisms underlying this emergent behavior. This paper posits that numerical stability issues, specifically floating-point inaccuracies in the Softmax function, notably impede grokking when regularization is absent.

Key Contributions and Insights

The authors provide a comprehensive examination of grokking beyond the commonly investigated conditions necessitating regularization. They identify a critical failure point, termed "Softmax Collapse" (SC), which arises from floating point errors during the computation of the Softmax function. These errors culminate in zero-gradient conditions that arrest the learning process, often before any test accuracy improvements are observed. This section elucidates why models fail to generalize without regularization, even as they excel in fitting the training data.

The research introduces critical innovations to circumvent SC, thereby enabling grokking in scenarios previously dependent on regularization. The first is a novel activation function called $StableMax$ , offering a numerically stable alternative to conventional Softmax by mitigating floating-point inaccuracies. The second innovation is an optimization algorithm, $\perp$ Grad, which curtails Naïve Loss Minimization (NLM) — a process where gradients align in a direction that scales logits without altering model predictions but undermines generalization.

Empirical Findings

The empirical results are robust, demonstrating that both $StableMax$ and $\perp$ Grad restore grokking capabilities in deep models trained on typical grokking datasets like modular arithmetic and Sparse Parity, under non-regularized settings. Notably, experimental validation with increased floating-point precision further substantiates that SC is instrumental in inhibiting the learning process. These interventions illustrate that weight decay or decreased weight norms are not prerequisites for grokking, countering previous hypotheses that emphasized regularization as essential.

Theoretical constructs proposed in this work, notably the conceptualization of NLM, provide an insightful lens for understanding the peculiar delay in generalization observed in grokking. By intercepting gradient dynamics at the onset of alignment with NLM directions, $\perp$ Grad catalyzes an expedited transition to generalization, thereby foregoing prolonged overfitting.

Implications and Future Directions

The insights presented extend the understanding of grokking as not merely an artifact of model regularization but as significantly influenced by the numerical intricacies of deep learning computation. This reframing has practical implications, suggesting that attention to numerical precision, beyond architectural or regularization adjustments, can be pivotal in training stability and performance.

The paper also opens several avenues for future inquiry, notably in exploring alternative numerical stabilization techniques or extending the applicability of $StableMax$ and $\perp$ Grad to more complex architectures and real-world datasets. Developing a deeper theoretical framework around NLM, particularly in quasi-homogeneous neural models, could further demystify the underpinnings of grokking and enhance generalization strategies in deep learning.

In conclusion, "Grokking at the Edge of Numerical Stability" comprehensively addresses the unresolved questions surrounding grokking. It offers a paradigm shift, situating numerical stability as central to understanding delayed generalization and opening new pathways for efficient deep learning model training.