Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explaining grokking through circuit efficiency (2309.02390v1)

Published 5 Sep 2023 in cs.LG

Abstract: One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when the task admits a generalising solution and a memorising solution, where the generalising solution is slower to learn but more efficient, producing larger logits with the same parameter norm. We hypothesise that memorising circuits become more inefficient with larger training datasets while generalising circuits do not, suggesting there is a critical dataset size at which memorisation and generalisation are equally efficient. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. Most strikingly, we demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vikrant Varma (10 papers)
  2. Rohin Shah (31 papers)
  3. Zachary Kenton (18 papers)
  4. János Kramár (19 papers)
  5. Ramana Kumar (16 papers)
Citations (36)

Summary

Explaining Grokking through Circuit Efficiency: A Detailed Analysis

Introduction

The paper "Explaining grokking through circuit efficiency" by Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar addresses the enigmatic phenomenon of grokking observed in neural networks. Grokking, as initially identified by Power et al. (2021), describes a process where a neural network, initially showing perfect training accuracy but poor generalization, transitions to perfect generalization with further training. The current work proposes that grokking can be explained through the differential efficiency of memorizing and generalizing circuits within the network.

Key Contributions

The authors make several notable contributions:

  1. Theory of Circuit Efficiency: They introduce a theory positing that grokking occurs due to the interplay between a memorizing circuit (\Mem) and a generalizing circuit (\Gen). While \Mem is learned quickly, \Gen is more efficient but slower to learn. Efficiency here is defined as the ability to produce larger logits with a lower parameter norm.
  2. Empirical Validation: Four predictions derived from the theory are empirically confirmed. These include the independence of \Gen's efficiency from dataset size, the decreasing efficiency of \Mem with increasing dataset size, the existence of a critical dataset size ($\criticalDatasetSize$) where \Gen becomes more efficient than \Mem, and the introduction of novel phenomena such as ungrokking and semi-grokking.
  3. New Behaviors: Two new behaviors, ungrokking (regression from generalization to memorization) and semi-grokking (transition to partial generalization), are demonstrated and analyzed.

Insights into Circuit Efficiency

Generalization vs. Memorization

The paper hypothesizes two families of circuits within the network:

  • \Mem: A family of circuits that achieve perfect training performance through memorization but fail to generalize.
  • \Gen: A family of circuits that achieve perfect generalization but are slower to learn.

The generalizing circuit (\Gen) is more efficient, defined as achieving similar logits to \Mem but with a lower parameter norm. This efficiency becomes crucial as the dataset size increases, making \Gen increasingly favored by weight decay.

Empirical Observations

Empirical evidence is provided to support the hypothesis:

  • Efficiency Analysis: \Mem's efficiency decreases with increasing dataset size, whereas \Gen's efficiency remains constant irrespective of dataset size. This is quantitatively validated through experiments on a modular addition task and other algorithmic tasks.
  • Critical Dataset Size: The existence of a pivotal dataset size $\criticalDatasetSize$ is confirmed. When $\numTrainExamples > \criticalDatasetSize$, \Gen circuits dominate, leading to grokking. Conversely, when $\numTrainExamples < \criticalDatasetSize$, \Mem circuits dominate, preventing generalization.
  • Ungrokking: When grokked networks (trained on large datasets) are retrained on significantly smaller datasets, a phase transition occurs, leading to a marked decrease in test accuracy, termed ungrokking.
  • Semi-grokking: Observed when training on dataset sizes around $\criticalDatasetSize$, where the network shows delayed generalization to partial rather than perfect test accuracy.

Theoretical and Practical Implications

The research presents several implications:

  • Understanding Regularization: It emphasizes the role of regularization techniques such as weight decay in favoring efficient circuits, which contribute to generalization over memorization.
  • Design of Neural Networks: Insights on circuit efficiency can inform the design of neural networks, potentially aiding in balancing memorization and generalization capacities.
  • Future Research: The phenomena of ungrokking and semi-grokking open new avenues for exploring the dynamics of learning and generalization within neural networks.

Future Directions

Potential future developments based on the findings include:

  1. Extended Analysis of Implicit Regularization: Investigating other implicit regularization effects that may contribute to grokking, especially in the absence of explicit weight decay.
  2. Broader Applicability: Extending the findings to more complex, real-world tasks and architectures beyond the algorithmic scope.
  3. Finer-Grained Circuit Analysis: Developing more precise techniques to isolate and analyze the roles and efficiencies of different circuit families within neural networks.

Conclusion

This paper provides a significant step forward in understanding the grokking phenomenon in neural networks. By elucidating the role of circuit efficiency and confirming novel predictions empirically, the authors offer a comprehensive theoretical framework with important implications for neural network training and generalization. The introduction of concepts like critical dataset size and behaviors such as ungrokking and semi-grokking marks a substantial contribution to the field, paving the way for further inquiries into the intricacies of deep learning mechanisms.

Youtube Logo Streamline Icon: https://streamlinehq.com