Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 420 tok/s Pro

Claude Sonnet 4.5 30 tok/s Pro

2000 character limit reached

Is Grokking a Computational Glass Relaxation? (2505.11411v1)

Published 16 May 2025 in cs.LG and cond-mat.dis-nn

Abstract: Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

Summary

Insights on the Computational Interpretation of Grokking

The paper's central focus is on exploring the phenomenon known as "grokking" in neural networks, particularly in modular arithmetic tasks. Grokking refers to the delayed sudden generalization of neural networks long after achieving near-perfect training accuracy. The authors of this paper propose a novel interpretation of grokking through the lens of computational glass relaxation. They draw intriguing parallels between neural networks and physical systems, using concepts from statistical mechanics and glass physics to interpret neural networks' behavior during grokking.

The authors argue that grokking may not be a first-order phase transition, as previously suggested by Rubin et al., positing instead that it resembles the glassy relaxation process. This perspective stems from the observation that there is no entropy barrier between memorization and generalization states, challenging the notion of grokking as merely a transition across a free energy barrier. By mapping neural networks to physical systems, they're able to explore the entropy landscape as a function of training loss and test accuracy, discovering that grokking is more akin to a slow relaxation towards a high-entropy, stable state rather than a sharp phase transition.

The research identifies a significant high-entropy advantage under grokking, suggesting that states with higher entropy correlate with better generalization capabilities. This observation builds on previous work linking entropy to generalizability, proposing a much more profound significance in the context of grokking challenges. The paper further corroborates this insight by showing that networks constrained to fixed weight norms can eliminate grokking but still exhibit this entropy advantage.

In terms of numerical results, the paper confirms that reducing or eliminating grokking does not necessarily degrade generalization performance. Using an optimizer inspired by glass relaxation dynamics, termed WanD, the authors demonstrate that high-norm solutions can be found, challenging theories that attribute grokking phenomena solely to weight norm evolution towards a specific range ("Goldilocks zone"). This new optimizer can achieve equivalent generalization efficiencies to conventional methods and hints at promising new avenues for optimizer design.

The implications of this research are multifaceted. Practically, it shows potential for designing optimizers that exploit higher-entropy configurations to enhance generalization without the grokking lag. Theoretically, it opens up discussions about the physical analogy of neural networks and the application of statistical mechanics concepts in understanding learning dynamics. Future research might explore this analogy or explore other statistical-physical methodologies for developing better learning algorithms.

In summary, the authors assert that grokking is essentially a computational glass relaxation process. This interpretation provides fresh insight into understanding neural network behavior, bringing a novel perspective on generalization properties and potentially guiding new methodologies in the field of AI.