SGD dynamics and sample complexity on finite datasets

Analyze the stochastic gradient descent training dynamics and derive sample complexity bounds for one-layer transformers with softmax attention (e.g., the architecture studied for the q-sparse token selection task) under empirical risk minimization with finite datasets, clarifying convergence behavior and generalization requirements beyond population loss.

Background

Although the paper establishes gradient descent convergence on the population loss, practical training typically uses (stochastic) gradient descent on finite datasets. The authors emphasize that their analysis leverages population objectives, and that extending theory to empirical risk minimization remains unaddressed.

They explicitly pose analyzing SGD dynamics and obtaining sample complexity bounds on tasks studied in the literature—including the task analyzed in the paper—as an open problem.

References

It would be an interesting open problem to analyze the SGD dynamics and sample complexity on any of the existing tasks in the literature.

— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot (2406.06893 - Wang et al., 11 Jun 2024) in Appendix: Section “Limitation and discussion”

SGD dynamics and sample complexity on finite datasets

Sponsor

Background

References

Related Problems