Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers (2411.12118v4)

Published 18 Nov 2024 in cs.LG and cs.CL

Abstract: In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that LLMs can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (1)

Tiberiu Musat

Tweets

https://twitter.com/MusatTiberiu/status/1891377735609540834

https://twitter.com/Tiberiu_Musat_/status/1892968609552470368