Dice Question Streamline Icon: https://streamlinehq.com

Sample complexity beyond population-loss analysis

Establish sample complexity guarantees for learning the q-sparse token selection task using gradient-based training of a one-layer transformer with softmax attention and stochastic positional encoding under empirical risk minimization with finitely many training samples, thereby moving beyond analyses restricted to population loss.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper proves global convergence of gradient descent on the population loss for a one-layer transformer with stochastic positional encoding on the q-sparse token selection task and shows expressive power separation from FCNs. However, all analyses are conducted at the population level, i.e., assuming access to the exact data-generating distribution rather than finite-sample empirical risk.

The authors explicitly note the need to move beyond population-loss results to obtain finite-sample guarantees. This entails quantifying how many samples are required for the training algorithm to achieve small generalization error when optimizing empirical risk.

References

There are still many open questions. For instance, can we move beyond population loss and show a sample complexity guarantee?

Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot (2406.06893 - Wang et al., 11 Jun 2024) in Conclusion