Sample complexity beyond population-loss analysis
Establish sample complexity guarantees for learning the q-sparse token selection task using gradient-based training of a one-layer transformer with softmax attention and stochastic positional encoding under empirical risk minimization with finitely many training samples, thereby moving beyond analyses restricted to population loss.
References
There are still many open questions. For instance, can we move beyond population loss and show a sample complexity guarantee?
— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
(2406.06893 - Wang et al., 11 Jun 2024) in Conclusion