Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers (2402.13512v1)

Published 21 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Modern LLMs rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  2. The quarks of attention: Structure and capacity of neural attention building blocks. Artificial Intelligence, 319:103901, 2023. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2023.103901. URL https://www.sciencedirect.com/science/article/pii/S0004370223000474.
  3. Local rademacher complexities. The Annals of Statistics, 33(4), August 2005. ISSN 0090-5364. doi: 10.1214/009053605000000282. URL http://dx.doi.org/10.1214/009053605000000282.
  4. A low-rank spectral method for learning markov models. Optimization Letters, 17(1):143–162, 2023.
  5. Billingsley, P. Statistical methods in markov chains. The annals of mathematical statistics, pp.  12–40, 1961.
  6. Smoothed online learning for prediction in piecewise affine systems. arXiv preprint arXiv:2301.11187, 2023.
  7. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  8. Learning and testing irreducible Markov chains via the k𝑘kitalic_k-cover time. In Feldman, V., Ligett, K., and Sabato, S. (eds.), Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pp.  458–480. PMLR, 16–19 Mar 2021.
  9. Information-theoretic considerations in batch reinforcement learning, 2019.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  11. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
  12. Inductive biases and variable creation in self-attention mechanisms. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5793–5831. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/edelman22a.html.
  13. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
  14. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics and Control, pp.  851–861. PMLR, 2020.
  15. Offline reinforcement learning: Fundamental barriers for value function approximation, 2022.
  16. What can a single attention layer learn? a study through the random features lens. arXiv preprint arXiv:2307.11353, 2023.
  17. A theoretical analysis of the repetition problem in text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  12848–12856, 2021.
  18. What can transformers learn in-context? a case study of simple function classes. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30583–30598. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c529dba08a146ea8d6cf715ae8930cbe-Paper-Conference.pdf.
  19. On learning markov chains. Advances in Neural Information Processing Systems, 31, 2018.
  20. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  21. Vision transformers provably learn spatial structure. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=eMW9AkXaREI.
  22. Is pessimism provably efficient for offline rl?, 2022.
  23. Generalization bounds for time series prediction with non-stationary processes. In International conference on algorithmic learning theory. Springer, 2014.
  24. Time series prediction and online learning. In Conference on Learning Theory, pp.  1190–1213. PMLR, 2016.
  25. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=jClGv3Qjhb.
  26. Estimation of markov chain via rank-constrained likelihood. In International Conference on Machine Learning, pp.  3033–3042. PMLR, 2018.
  27. Transformers as algorithms: Generalization and stability in in-context learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  19565–19594. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/li23l.html.
  28. Mechanics of next token prediction with self-attention. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
  29. Straight to the gradient: Learning to use novel tokens for neural text generation. In International Conference on Machine Learning, pp.  6642–6653. PMLR, 2021.
  30. Attention with markov: A framework for principled analysis of transformers via markov chains, 2024.
  31. Active learning for nonlinear system identification with guarantees. arXiv preprint arXiv:2006.10277, 2020.
  32. A tutorial on concentration bounds for system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp.  3741–3749. IEEE, 2019.
  33. Rademacher complexity bounds for non-iid processes. Advances in Neural Information Processing Systems, 21, 2008.
  34. Oymak, S. Stochastic gradient descent learns state equations with nonlinear activations. In conference on Learning Theory, pp.  2551–2579. PMLR, 2019.
  35. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pp.  5655–5661. IEEE, 2019.
  36. Revisiting ho–kalman-based system identification: Robustness and finite-sample analysis. IEEE Transactions on Automatic Control, 67(4):1914–1928, 2021.
  37. On the role of attention in prompt-tuning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  26724–26768. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/oymak23a.html.
  38. Using the output embedding to improve language models, 2017.
  39. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Bridging offline reinforcement learning and imitation learning: A tale of pessimism, 2023.
  42. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pp.  5610–5618. PMLR, 2019.
  43. Non-asymptotic and accurate learning of nonlinear dynamical systems. The Journal of Machine Learning Research, 23(1):6248–6296, 2022.
  44. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
  45. Sample efficient reinforcement learning via low-rank matrix estimation. Advances in Neural Information Processing Systems, 33:12092–12103, 2020.
  46. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pp.  439–473. PMLR, 2018.
  47. Smoothness, low noise and fast rates. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A. (eds.), Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/76cf99d3614e23eabab16fb27e944bf9-Paper.pdf.
  48. Spectral entry-wise matrix estimation for low-rank reinforcement learning. arXiv preprint arXiv:2310.06793, 2023.
  49. Finite sample identification of low-order lti systems via nuclear norm regularization. IEEE Open Journal of Control Systems, 1:237–254, 2022.
  50. Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023a.
  51. Max-margin token selection in attention mechanism. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  52. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. arXiv preprint arXiv:2305.16380, 2023.
  53. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  54. Statistical learning theory for control: A finite sample perspective. arXiv preprint arXiv:2209.05423, 2022.
  55. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  56. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  57. Transformers learn in-context by gradient descent. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  35151–35174. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/von-oswald23a.html.
  58. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
  59. Consistency of a recurrent language model with respect to incomplete decoding. arXiv preprint arXiv:2002.02492, 2020.
  60. Minimax learning of ergodic markov chains. In Algorithmic Learning Theory, pp.  904–930. PMLR, 2019.
  61. Statistical estimation of ergodic markov chain kernel over discrete state space. Bernoulli, 27(1):532–553, 2021.
  62. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
  63. Q* approximation schemes for batch reinforcement learning: A theoretical comparison, 2020.
  64. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35:3082–3095, 2022.
  65. Yu, B. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp.  94–116, 1994.
  66. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr.
  67. Offline reinforcement learning with realizability and single-policy concentrability, 2022.
  68. Spectral state compression of markov processes. IEEE transactions on information theory, 66(5):3202–3231, 2019.
  69. Learning markov models via low-rank optimization. Operations Research, 70(4):2384–2398, 2022.
  70. Learning with little mixing. Advances in Neural Information Processing Systems, 35:4626–4637, 2022a.
  71. Learning with little mixing. In Advances in Neural Information Processing Systems, 2022b.
  72. Sharp rates in dependent learning theory: Avoiding sample size deflation for the square loss. arXiv preprint arXiv:2402.05928, 2024.
  73. Single trajectory nonparametric learning of nonlinear dynamics. In conference on Learning Theory, pp.  3333–3364. PMLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. M. Emrullah Ildiz (8 papers)
  2. Yixiao Huang (16 papers)
  3. Yingcong Li (16 papers)
  4. Ankit Singh Rawat (64 papers)
  5. Samet Oymak (94 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com