Impact of spreading memorization from practical activations on large-scale performance

Determine whether the spreading memorization behavior induced by activations with strictly decreasing φ(x)=σ′(x)/x (such as ReLU, SiLU, Tanh, and Sigmoid) leads to better results than focused memorization in large-scale settings for group arithmetic tasks learned by two-layer nonlinear networks.

Background

The paper proves that, under limited and skewed data for a single target in group arithmetic tasks, the optimal features correspond to memorization, with the type of memorization depending on the activation’s derivative-to-input ratio φ(x)=σ′(x)/x. Power activations (e.g., σ(x)=x²⁾ produce focused memorization of single pairs, whereas practical activations (e.g., ReLU, SiLU, Tanh, Sigmoid) yield spreading memorization across multiple pairs.

The authors explicitly pose the question of whether this spreading memorization property translates into better performance at scale, leaving it as future work. This problem seeks to evaluate the practical implications of the theoretically characterized memorization regimes in large-scale training contexts.

References

We can verify that power activations (e.g., \sigma(x) = x²⁾ lead to focused memorization, while more practical ones (e.g., ReLU, SiLU, Tanh and Sigmoid) lead to spreading memorization. We leave it for future work whether this property leads to better results in large scale settings.

— Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking (2509.21519 - Tian, 25 Sep 2025) in Section "Stage II: Independent feature learning", Subsection "The Scaling Laws of the boundary of memorization and generalization" (discussion after Theorem "Memorization solution")

Impact of spreading memorization from practical activations on large-scale performance

Background

References

Related Problems