Optimization and generalization in the Θ(nω) parameter regime
Investigate the optimization and generalization behavior of one-layer multi-head decoder-only transformers in the Θ(nω) parameter regime where the number of model parameters scales on the order of nω, with n the number of distinct contexts and ω the vocabulary size.
References
Theoretically investigating optimization and generalization in the Θ(nω) parameter regime is left as a future direction.
— Next-token prediction capacity: general upper bounds and a lower bound for transformers
(2405.13718 - Madden et al., 22 May 2024) in Section 9 (Conclusion), first paragraph