Analytical characterization of why gradient-based training reaches global minima
Establish a rigorous analytical characterization of why gradient-based optimization methods such as Adam converge to global minima when training the single-head tied-attention empirical risk minimization problem defined in Equation (2) with squared loss and weight decay under the high-dimensional regime considered in this paper, despite the non-convexity of the objective.
Sponsor
References
We observe an excellent match for both the test error and the training loss, which given the non-convexity of the loss is a highly non-trivial phenomenon, whose analytical characterization is a challenging problem that we leave for future work.
— Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions
(2509.24914 - Boncoraglio et al., 29 Sep 2025) in Section 4, Behavior of gradient descent (after Figures 1–2)