In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization (2402.14951v1)

Published 22 Feb 2024 in stat.ML, cs.CL, and cs.LG

Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (38)

Authors (3)

Ruiqi Zhang (58 papers)
Jingfeng Wu (34 papers)
Peter L. Bartlett (86 papers)

Citations (11)

View on Semantic Scholar

Tweets

https://twitter.com/RuiqiZhang0614/status/1762706935244103938

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization (2402.14951v1)

Related Papers

Tweets