Isolating architectural effects from optimization in grokking comparisons
Develop experimental methodology that fully isolates the causal effect of model architecture—specifically Multilayer Perceptrons versus Transformers—on grokking dynamics, independent of confounding factors introduced by optimizer choice (e.g., SGD versus AdamW) and regularization strength (e.g., weight decay), so that architecture-level comparisons are not conflated with optimization and regularization differences.
References
Architectural comparisons require different optimizers and regularization regimes (e.g., SGD for MLPs, AdamW for Transformers) to ensure stable training. Although we carefully match configurations, fully isolating architecture from optimization is an open challenge.
— A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
(2603.25009 - Manir et al., 26 Mar 2026) in Discussion — Limitations subsection