Incorporate Training Dynamics to Explain Double Descent and Grokking
Develop an extension of the concept–text and skill–concept graph framework that explicitly models training time and the number of training epochs, and determine whether this extended framework can reproduce and explain the double descent and grokking phenomena observed in neural network training.
References
There are some open questions and considerations worth exploring. We do not take training time into account in our framework. Therefore, we do not explain (or attempt to explain) empirical phenomena such as double descent or grokking . Perhaps future work can either incorporate training epochs in our framework or propose a different novel framework to explain them.
— An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models
(2410.01243 - Nayak et al., 2 Oct 2024) in Conclusion, final paragraph