From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers (2402.13512v1)

Published 21 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Modern LLMs rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (5)

M. Emrullah Ildiz (8 papers)
Yixiao Huang (16 papers)
Yingcong Li (16 papers)
Ankit Singh Rawat (64 papers)
Samet Oymak (94 papers)

Citations (12)

View on Semantic Scholar

Tweets

https://twitter.com/GrdCrf/status/1761480486638018725

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers (2402.13512v1)

Related Papers

Tweets