Dice Question Streamline Icon: https://streamlinehq.com

Incorporate Training Dynamics to Explain Double Descent and Grokking

Develop an extension of the concept–text and skill–concept graph framework that explicitly models training time and the number of training epochs, and determine whether this extended framework can reproduce and explain the double descent and grokking phenomena observed in neural network training.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces a unified information-theoretic framework based on concept–text and skill–concept bipartite graphs to explain compute-optimal scaling, emergence, and plateaus in LLMs. While the framework captures many scaling phenomena via static graph properties and iterative peeling (analogous to LDPC decoding), it abstracts away training dynamics.

In the conclusion, the authors explicitly note that training time is not modeled, which prevents the framework from addressing dynamic phenomena such as double descent and grokking. They explicitly flag this as an open question and suggest incorporating training epochs or proposing a different framework to address these phenomena.

References

There are some open questions and considerations worth exploring. We do not take training time into account in our framework. Therefore, we do not explain (or attempt to explain) empirical phenomena such as double descent or grokking . Perhaps future work can either incorporate training epochs in our framework or propose a different novel framework to explain them.

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models (2410.01243 - Nayak et al., 2 Oct 2024) in Conclusion, final paragraph