Rank Minimization in Grokking
- The paper finds that neural networks transition from high-rank representations to low-rank solutions, with a sharp rank collapse tightly correlating (Pearson r > 0.9) with improved test accuracy.
- Regularization techniques like nuclear norm promote rank collapse in parameter matrices, inducing a two-phase training dynamic where memorization is followed by low-rank generalizing solutions.
- Grokking as low-rank tensor learning demonstrates that structured tasks benefit from decomposing high-dimensional targets into low-rank factors, offering new insights for neural network representation.
Rank minimization in grokking refers to the empirical and theoretical observation that neural networks, during extended training, transition from fitting the data with high-rank internal representations or parameter matrices to discovering low-rank solutions that generalize, and that the timing and sharpness of this rank collapse closely coincide with dramatic improvements in test accuracy. This process is now recognized as a universal mechanism underlying the grokking phenomenon, whether in deep multilayer perceptrons, overparameterized two-layer networks learning group-theoretic tasks, or linear models regularized to induce low rank. Across these contexts, rank minimization can be tracked via numerical feature ranks of activations, ranks of learned tensors, or singular value spectra of parameter matrices, with distinct and measurable transitions aligned with generalization.
1. Feature Rank in Deep Neural Networks
In deep neural networks exhibiting grokking, feature rank is defined via the singular values of the activation matrix for each layer. If denotes the activations of layer over a batch of examples, the sample covariance matrix is constructed as . The numerical rank is then the number of singular values exceeding a machine-precision-dependent threshold. This rank is estimated per-layer throughout training.
Empirically, during initial overfitting, feature ranks are maximal and decrease only gradually. The onset of sharp generalization—the grokking transition—is tightly coupled to a sudden drop in one or more layer-wise feature ranks. In deep networks, this rank minimization can proceed in multiple stages, echoing the double-descent phenomenon: test accuracy jumps, plateaus, followed by a second rank collapse and accuracy surge. This pattern is prominent in deep MLPs (e.g., 12-layer networks trained on MNIST), whereas shallow architectures typically exhibit a single transition. Correlation analyses report Pearson coefficients exceeding 0.9 between the timing of rank collapse and test-accuracy improvement, a much tighter predictive relationship than can be obtained from parameter norm metrics (Fan et al., 29 May 2024).
2. Regularization-Induced Rank Collapse
Grokking can be induced by regularization strategies that bias model parameters toward low-rank or sparse solutions. In linear matrix sensing, the application of nuclear norm () regularization on the parameter matrix produces a two-phase training dynamic. The initial memorization phase exhibits rapid training error convergence but no rank reduction and no generalization. Once the gradient of the empirical loss vanishes, gradient descent with nuclear norm continues to shrink the singular values of linearly until excess singular values cross zero, effecting discrete drops in rank.
The total grokking duration scales inversely with the strength of regularization. Experimentally, in low-rank linear tasks, the parameter rank collapses stepwise from full to minimal during an extended interval after perfect train fit, matching the sharp drop in test error. norm regularization, by contrast, induces only uniform scaling of singular values and never steers the model to exact low-rank solutions, thereby failing to reliably induce grokking or rank collapse (Notsawo et al., 6 Jun 2025).
3. Grokking as Low‐Rank Tensor Learning
In structured tasks such as learning group operations, the function to be learned can be represented as a 3-tensor whose entries encode the combinatorial output of group word operations. The tensor rank provides a measure of representational complexity. A two-layer neural network, or equivalently a Hadamard network, approximates this tensor as a sum of rank-one factors, with the required network width matching the minimal rank needed to represent or approximate the tensor.
Empirically, networks first fit the training data via high-rank approximations (high-width solutions) that do not generalize. Eventually, under continued training and where the model has sufficient width, the network discovers a low-rank decomposition aligned with the problem’s algebraic or group-theoretic structure, at which point test loss collapses—i.e., grokking occurs. Theoretical analyses provide tight bounds on the minimal achievable rank via group fusion and efficient matrix multiplication algorithms (Strassen-type bounds) and demonstrate that gradient descent dynamics prefer these low-rank solutions once width is sufficient and training is prolonged. These transitions are visible in the singular value structure and result in distinctly narrower valleys in parameter space associated with generalizing solutions (Shutman et al., 8 Sep 2025).
4. Quantitative and Qualitative Indicators of Rank Minimization
Across all experimental paradigms, the most reliable quantitative indicator of the grokking transition is a precipitous drop in internal feature rank, tensor rank, or matrix rank of parameters, as measured by singular value thresholding. This is observed at precise points in training corresponding to the first rise in test accuracy after a prolonged overfitting phase. By contrast, metrics such as the global parameter norm or weight decay exhibit only smooth or monotonic changes with no sharp signature at the grokking threshold. Linear-probe accuracy of hidden representations increases in tandem with feature rank reduction (Fan et al., 29 May 2024).
The following table summarizes key empirical observables and their alignment with grokking events:
| Observable | Behavior at Grokking | Predictive Value |
|---|---|---|
| Layerwise Feature Rank | Sharp Decrease | High (r > 0.9) |
| Parameter Matrix Rank | Discrete Step Down | High |
| Nuclear Norm () | Linear Decrease | High |
| Weight Norm () | Monotonic, Smooth | Low |
| Test Accuracy | Sudden Improvement | – |
| Training Loss | Already Near Zero | – |
5. Theoretical Underpinnings and Open Problems
Theoretical analyses substantiate that, for linear and low-rank matrix recovery with nuclear norm regularization, the two-phase grokking dynamic arises rigorously: first via fast error minimization, then via slow rank reduction as singular values are shrunk to zero. The time to achieve rank minimization and full generalization is proportional to , with the learning rate and the regularization strength. Similar statements hold for low-rank tensor approximation in group-based classification, where the minimal achievable rank is determined by group structure and fusion algebra.
Empirical and combinatorial results demonstrate the universality of rank minimization in grokking, but several open questions remain. These include precise characterization of the attractors in gradient descent—i.e., whether low-rank minima are globally or merely locally attractive for general classes of architectures—and extension of analytical results to generic nonlinear activations. Further, the connection between feature rank collapse and the emergence of linearly separable internal representations is a promising area for mechanistic study (Fan et al., 29 May 2024, Notsawo et al., 6 Jun 2025, Shutman et al., 8 Sep 2025).
6. Comparative Perspective: Rank Minimization vs. Other Generalization Mechanisms
While early work hypothesized that generalization in grokking might be mediated by weight-norm drift toward a “Goldilocks zone,” empirical studies decisively indicate that this is neither necessary nor sufficient. The generalization transition is governed specifically by the compression of high-dimensional representations—formally, a drop in rank—rather than any scalar function of weight magnitude. Rank collapse is observed as the primary, and often sole, signature of impending grokking in deep and shallow networks, across varying regularization paradigms and model depths.
A plausible implication is that future mechanistic explanations of generalization and delayed test accuracy improvement in overparameterized networks should focus on the geometry and dynamics of rank-minimizing trajectories in both the activation and parameter space, rather than on norm-based considerations alone (Fan et al., 29 May 2024, Notsawo et al., 6 Jun 2025).
7. Significance and Future Directions
Rank minimization as the driver of grokking has broad significance for understanding representational learning, sample complexity, and the nature of inductive bias in both shallow and deep overparameterized models. Empirical evidence across paradigms establishes that gradient descent, especially when coupled with suitable regularization or network architecture, naturally uncovers and aligns with low-rank decompositions that encode the simplest sufficient solution for the data.
Ongoing questions concern the detailed mechanisms by which networks escape high-rank overfitting minima, the universality of rank-driven grokking in non-synthetic or real-world data, and the design of initialization, regularization, or architectural schemes that either promote or suppress multistage grokking phenomena—such as the double-descent in deep networks. Further, the algebraic structure of learning targets (as in group-theoretic tasks) points toward new connections between learning dynamics, tensor decomposition, and the broader mathematics of representation theory (Shutman et al., 8 Sep 2025).