Impact of spreading memorization from practical activations on large-scale performance
Determine whether the spreading memorization behavior induced by activations with strictly decreasing φ(x)=σ′(x)/x (such as ReLU, SiLU, Tanh, and Sigmoid) leads to better results than focused memorization in large-scale settings for group arithmetic tasks learned by two-layer nonlinear networks.
References
We can verify that power activations (e.g., \sigma(x) = x2) lead to focused memorization, while more practical ones (e.g., ReLU, SiLU, Tanh and Sigmoid) lead to spreading memorization. We leave it for future work whether this property leads to better results in large scale settings.
— Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
(2509.21519 - Tian, 25 Sep 2025) in Section "Stage II: Independent feature learning", Subsection "The Scaling Laws of the boundary of memorization and generalization" (discussion after Theorem "Memorization solution")