Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 420 tok/s Pro

Claude Sonnet 4.5 30 tok/s Pro

2000 character limit reached

Grokking as Compression: A Nonlinear Complexity Perspective (2310.05918v1)

Published 9 Oct 2023 in cs.LG, cs.AI, and stat.ML

Abstract: We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. To do so, we define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although the $L_2$ norm has been a popular choice for characterizing model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaining grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces the LMN metric as a novel complexity measure that explains network compression and delayed generalization in grokking.
Experimental results on modular addition, permutation group S4, and multi-digit XOR tasks validate LMN’s ability to capture decreasing complexity and reveal double-descent patterns.
The study challenges traditional L2 norms by proposing LMN as an analogue to Kolmogorov complexity, offering new insights into neural network generalization.

Grokking as Compression: A Nonlinear Complexity Perspective

The paper "Grokking as Compression: A Nonlinear Complexity Perspective" by Ziming Liu, Ziqian Zhong, and Max Tegmark offers a detailed exploration of the grokking phenomenon via network complexity measures. Grokking is a notable occurrence in deep learning, exhibiting delayed generalization following memorization. The authors introduce the linear mapping number (LMN), an innovative complexity metric designed to better measure network behavior compared to traditional $L_2$ norms.

Core Contributions and Methodology

The paper postulates that grokking can be attributed to a compression phase where the network's internal complexity diminishes, eventually leading to generalization. The paper criticizes $L_2$ norm's inadequacy for measuring complexity, advocating instead for the LMN. The proposed LMN metric is an extension of the linear region number, traditionally used for ReLU networks, generalized to encompass networks with various activation functions. LMN quantifies the number of distinct linear mappings within a network, offering insights into its complexity by examining the partitioning of input space into regions characterized by local linear mappings.

Using tasks including modular addition, permutation group $S_4$ , and multi-digit XOR, the authors empirically demonstrate that LMN decreases following the memorization phase, substantiating the compression hypothesis. A notable revelation is observed in the multi-digit XOR task, where the LMN exhibits a double-descent pattern post-grokking, suggesting the existence of multiple generalization solutions.

Implications and Future Directions

The implication of this work is multifaceted. The introduction of LMN as a complexity measure that correlates linearly with test loss offers a potentially more intuitive understanding of network compression and the grokking phenomenon. Additionally, the paper suggests LMN could serve as a neural network analogue to Kolmogorov complexity, emphasizing the importance of incorporating local linear computations into complexity assessments. This perspective invites further exploration of LMN across diverse architectures and tasks to establish its robustness and generalizability.

Future research could explore the mechanistic underpinnings of observed phenomena like the XOR task's double-descent pattern, which signals complex dynamics within generalization solutions. Such investigations could unveil deeper insights into the interplay between network structure and function during the transition from memorization to effective generalization.

In summary, this paper contributes a compelling perspective on network complexity, offering LMN as a novel tool to interpret the intricate dynamics of learning and generalization in deep networks, particularly within the context of grokking. It encourages the community to reassess traditional complexity measures and consider more nuanced, computation-informed metrics that align with the operational nature of modern artificial neural networks.