Training dynamics with fixed positional encodings
Analyze gradient descent training dynamics and convergence for a one-layer transformer with softmax attention using a fixed near-orthogonal positional encoding (rather than stochastic positional encoding) on the q-sparse token selection task, with respect to the in-distribution (population) loss.
References
Nevertheless, analyzing the dynamics with a set of fixed positional encodings on the in-distribution loss can be an interesting open problem.
— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
(2406.06893 - Wang et al., 11 Jun 2024) in Appendix: Section “Limitation and discussion”