Training dynamics with fixed positional encodings

Analyze gradient descent training dynamics and convergence for a one-layer transformer with softmax attention using a fixed near-orthogonal positional encoding (rather than stochastic positional encoding) on the q-sparse token selection task, with respect to the in-distribution (population) loss.

Background

The paper relies on stochastic positional encoding to obtain both theoretical tractability and strong length generalization, and notes empirically that fixed positional encodings hinder out-of-distribution generalization.

The authors explicitly identify analyzing the training dynamics when positional encodings are fixed (near-orthogonal) as an open problem, to understand whether similar convergence guarantees can be established in-distribution without stochasticity.

References

Nevertheless, analyzing the dynamics with a set of fixed positional encodings on the in-distribution loss can be an interesting open problem.

— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot (2406.06893 - Wang et al., 11 Jun 2024) in Appendix: Section “Limitation and discussion”

Training dynamics with fixed positional encodings

Sponsor

Background

References

Related Problems