Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data (2402.00743v2)

Published 1 Feb 2024 in cs.LG, cs.CL, and stat.ML

Abstract: LLMs are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yue Xing (47 papers)
  2. Xiaofeng Lin (13 papers)
  3. Namjoon Suh (10 papers)
  4. Qifan Song (37 papers)
  5. Guang Cheng (136 papers)
  6. Chenheng Xu (4 papers)
Citations (3)