Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Next-token prediction capacity: general upper bounds and a lower bound for transformers (2405.13718v3)

Published 22 May 2024 in cs.LG and math.OC

Abstract: Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Baum EB (1988) On the capabilities of multilayer perceptrons. Journal of Complexity 4(3):193–215
  2. Chen S, Li Y (2024) Provably learning a multi-head attention layer. arXiv preprint arXiv:240204084
  3. Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers 3:326–334
  4. van den Dries L (1998) Tame Topology and O-minimal Structures. Cambridge University Press
  5. van den Dries L, Miller C (1994) On the real exponential field with restricted analytic functions. Israel Journal of Mathematics 85:19–56
  6. Eldan R, Li Y (2023) Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:230507759
  7. Gage P (1994) A new algorithm for data compression. The C Users Journal 12(2):23–38
  8. Hermite (1873) Sur la fonction exponentielle. Comptes rendus de l’Académie des Sciences
  9. Huang GB, Babri H (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks 9(1):224–229
  10. Kajitsuka T, Sato I (2024) Are transformers with one layer self-attention using low-rank weight matrices universal approximators? In: Conference on Learning Theory (COLT)
  11. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR)
  12. Kruskal JB (1977) Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications 18(2):95–138
  13. Madden L, Thrampoulidis C (2024) Memory capacity of two layer neural networks with smooth activations. arXiv preprint arXiv:230802001
  14. Malach E (2023) Auto-regressive next-token predictors are universal learners. arXiv preprint arXiv:230906979
  15. Sard A (1942) The measure of the critical values of differentiable maps. Bulletin of the American Mathematical Society 48:883–890
  16. Shannon CE (1948) A mathematical theory of communication. The Bell System Technical Journal 27:379–423, 623–656
  17. Shannon CE (1951) Prediction and entropy of printed english. The Bell System Technical Journal 30(1):50–64
  18. Sontag ED (1996) Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal nets. Advances in Computational Mathematics 5(1):245–268
  19. Sontag ED (1997) Shattering all sets of k points in “general position” requires (k - 1)/2 parameters. Neural Computation 9(2):337–348
  20. Tamura S (1991) Capabilities of a three layer feedforward neural network. IEEE International Joint Conference on Neural Networks 3:2757–2762
  21. Tamura S, Tateishi M (1997) Capabilities of a four-layered feedforward neural network: four layers versus three. IEEE Transactions on Neural Networks 8(2):251–255
  22. Thrampoulidis C (2024) Implicit bias of next-token prediction. arXiv preprint arXiv:240218551
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com