How Transformers Get Rich: Approximation and Dynamics Analysis (2410.11474v3)

Published 15 Oct 2024 in cs.LG, math.OC, and stat.ML

Abstract: Transformers have demonstrated exceptional in-context learning capabilities, yet the theoretical understanding of the underlying mechanisms remains limited. A recent work (Elhage et al., 2021) identified a rich'' in-context mechanism known as induction head, contrasting withlazy'' $n$-gram models that overlook long-range dependencies. In this work, we provide both approximation and dynamics analyses of how transformers implement induction heads. In the {\em approximation} analysis, we formalize both standard and generalized induction head mechanisms, and examine how transformers can efficiently implement them, with an emphasis on the distinct role of each transformer submodule. For the {\em dynamics} analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component. This controlled setting allows us to precisely characterize the entire training process and uncover an {\em abrupt transition} from lazy (4-gram) to rich (induction head) mechanisms as training progresses.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

Tweets

https://twitter.com/StatMLPapers/status/1846732405454901681