Analytical characterization of why gradient-based training reaches global minima

Establish a rigorous analytical characterization of why gradient-based optimization methods such as Adam converge to global minima when training the single-head tied-attention empirical risk minimization problem defined in Equation (2) with squared loss and weight decay under the high-dimensional regime considered in this paper, despite the non-convexity of the objective.

Background

The paper shows an excellent empirical agreement between theoretical predictions and Adam-based simulations for both training loss and test error when optimizing the non-convex empirical risk in the single-head tied-attention model. Despite this, the authors note that a theoretical explanation of why gradient-based methods succeed in finding global minima remains lacking.

They explicitly defer a formal analysis of the optimization dynamics responsible for this behavior, highlighting the need for a rigorous understanding of convergence properties and the landscape structure that enables global minimization in this setting.

References

We observe an excellent match for both the test error and the training loss, which given the non-convexity of the loss is a highly non-trivial phenomenon, whose analytical characterization is a challenging problem that we leave for future work.

Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions (2509.24914 - Boncoraglio et al., 29 Sep 2025) in Section 4, Behavior of gradient descent (after Figures 1–2)