SGD dynamics and sample complexity on finite datasets
Analyze the stochastic gradient descent training dynamics and derive sample complexity bounds for one-layer transformers with softmax attention (e.g., the architecture studied for the q-sparse token selection task) under empirical risk minimization with finite datasets, clarifying convergence behavior and generalization requirements beyond population loss.
References
It would be an interesting open problem to analyze the SGD dynamics and sample complexity on any of the existing tasks in the literature.
— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
(2406.06893 - Wang et al., 11 Jun 2024) in Appendix: Section “Limitation and discussion”