Isolating architectural effects from optimization in grokking comparisons

Develop experimental methodology that fully isolates the causal effect of model architecture—specifically Multilayer Perceptrons versus Transformers—on grokking dynamics, independent of confounding factors introduced by optimizer choice (e.g., SGD versus AdamW) and regularization strength (e.g., weight decay), so that architecture-level comparisons are not conflated with optimization and regularization differences.

Background

The paper shows that differences in grokking speed between Transformers and MLPs can be substantially driven by optimizer and regularization choices (e.g., AdamW vs. SGD and differing weight decay), rather than by architecture alone. Even with careful configuration matching, the authors note that architectural comparisons practically require different optimizers and regularization regimes to ensure stable training.

Because optimization and regularization significantly affect grokking dynamics, the authors identify that fully isolating the architectural contribution remains unresolved, highlighting the need for methodology that disentangles architecture from optimization and regularization effects in controlled comparisons.

References

Architectural comparisons require different optimizers and regularization regimes (e.g., SGD for MLPs, AdamW for Transformers) to ensure stable training. Although we carefully match configurations, fully isolating architecture from optimization is an open challenge.

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization  (2603.25009 - Manir et al., 26 Mar 2026) in Discussion — Limitations subsection