Linear Transformers are Versatile In-Context Learners (2402.14180v2)

Published 21 Feb 2024 in cs.LG

Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We analyze this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

PDF HTML Abstract

This paper "Linear Transformers are Versatile In-Context Learners" (Vladymyrov et al., 21 Feb 2024 ) investigates the in-context learning capabilities of linear transformers, particularly focusing on their ability to implicitly implement sophisticated optimization algorithms. The core finding is that even simple linear transformers, when trained on data provided within the input sequence, learn to solve the task by executing a variant of gradient descent, and this capability extends to discovering more complex, adaptive algorithms for challenging problems like noisy linear regression with mixed noise levels.

The paper establishes theoretically that any linear transformer layer maintains an implicit linear model of the input data $(x_i, y_i)$ . Specifically, if the input tokens to a layer are $(x_i^{\ell}, y_i^{\ell})$ , the output tokens $(x_i^{\ell+1}, y_i^{\ell+1})$ can be expressed as a linear transformation of the original input $(x_i, y_i)$ :

$x_i^{\ell+1} = M^{\ell} x_i + y_i u^{\ell}$

$y_i^{\ell+1} = a^{\ell} y_i - \langle w^{\ell}, x_i \rangle$

for some matrices $M^{\ell}$ , vectors $u^{\ell}, w^{\ell}$ , and scalar $a^{\ell}$ . These parameters $(M^{\ell}, u^{\ell}, a^{\ell}, w^{\ell})$ are not static weights but are recursively updated based on the layer's attention parameters $(P_k^\ell, Q_k^\ell)$ and the aggregated statistics of the current layer's input tokens $(x_j^\ell, y_j^\ell)$ , such as $\sum x_j^\ell (x_j^\ell)^\top$ , $\sum y_j^\ell x_j^\ell$ , etc. The prediction for a query $x_t$ is derived from the final layer's output token $(x_t^L, y_t^L)$ as $-y_t^L$ .

This implies that the linear transformer's forward pass on a sequence of data points effectively executes an iterative algorithm, where each layer represents a step. The algorithm is learning to find parameters $(a^L, w^L)$ for a linear model $\hat{y}_t = a^L y_t - \langle w^L, x_t \rangle$ that predicts the query $y_t$ . Since the query token has $y_t=0$ , the prediction simplifies to $\hat{y}_t = -\langle w^L, x_t \rangle$ , meaning the network is learning a weight vector $w^L$ to predict $y_t$ from $x_t$ .

A key practical consideration explored is the use of diagonal attention matrices $(Q_k^\ell, P_k^\ell)$ for each head. This restriction significantly simplifies the architecture and computation, reducing the complexity from $O(N^2)$ to $O(N)$ with respect to sequence length $N$ . The paper shows that for diagonal attention, the layer updates can be re-parameterized by four scalar variables: $l_{xx}, l_{xy}, l_{yx}, l_{yy}$ . These variables effectively control the flow of information between the $x$ and $y$ components across layers. Despite this simplification, the diagonal linear transformer (Diag) maintains significant power.

The paper relates the learned algorithm to gradient descent. The $\text{GDPP}$ model [oswald], a restricted diagonal linear transformer, is shown to implement a form of preconditioned gradient descent. For standard least squares problems, the paper proves $\text{GDPP}$ can achieve high accuracy in $O(\log \kappa + \log \log 1/\epsilon)$ steps, where $\kappa$ is the condition number of the data covariance matrix, suggesting a second-order optimization behavior similar to Newton's method. This indicates that simple linear transformers can learn efficient algorithms for well-defined problems.

The most compelling demonstration of linear transformers' versatility comes from the experiments on noisy linear regression with mixed noise variance. In this setup, each training sequence has data generated with a different noise level $\sigma_\tau^2$ , drawn from a distribution (e.g., uniform $U(0, \sigma_{max})$ or categorical). The optimal solution for a known $\sigma_\tau$ is ridge regression $w^*_{\sigma^2} = (\Sigma + \sigma_\tau^2 I)^{-1}\alpha$ . However, the noise level is unknown to the model and varies per sequence. The linear transformer must learn an in-context algorithm that adapts to the noise level of the current sequence to make good predictions.

The paper's reverse-engineering of the learned algorithm in the Diag model reveals mechanisms for noise adaptation:

The $l_{yy}$ term can lead to adaptive rescaling of the $y$ component based on $y$ norms $(\lambda)$ , which are correlated with noise levels. A negative $l_{yy}$ effectively shrinks predictions more when the noise is higher, aligning with the behavior of ridge regression which shrinks the OLS solution more for higher regularization (analogous to noise).
The $l_{xy}$ term influences the step size of the implicit gradient descent. Analysis suggests that it helps create a step size that depends on the residual variance $\sum r_i^2$ , another quantity correlated with noise. Higher residual variance leads to a smaller effective step size, again consistent with the intuition of effectively 'early stopping' or regularizing more when noise is high.

In experiments, both the Full and Diag linear transformers significantly outperform standard Ridge Regression baselines (ConstRR, AdaRR) and the simpler $\text{GDPP}$ model on mixed noise variance problems. The Diag model performs comparably to the Full model across various $\sigma_{max}$ values and number of layers (up to 7 layers were tested). This is a crucial finding for practical implementation, as the efficiency of Diag makes it much more suitable for longer sequences and larger models.

For implementation:

The model architecture is a stack of linear self-attention layers. For the diagonal variant, the attention calculation needs to be specialized to use only diagonal parameter matrices or their equivalent scalar reparameterization $(l_{xx}, l_{xy}, l_{yx}, l_{yy})$ .
Training involves minimizing the prediction error $-y_{n+1}^L$ for the query token $x_t$ using standard optimizers like Adam, on sequences containing multiple $(x_i, y_i)$ pairs followed by the query $(x_t, 0)$ .
The number of layers $L$ is a hyperparameter corresponding to the number of steps in the learned optimization algorithm. Experiments show performance improves with more layers, but significant gains are seen even with 3-5 layers.
The comparison between Diag and Full suggests that for tasks requiring adaptive linear estimation, the computational savings of the diagonal architecture come with minimal performance loss. This is highly relevant for scaling these models to larger data contexts.

The paper demonstrates that linear transformers, even with diagonal constraints, can move beyond simple gradient descent to learn sophisticated, adaptive algorithms directly from data presentation. This ability to learn data-dependent optimization strategies in-context holds promise for applications where models need to quickly adapt to varying data characteristics without explicit retraining or complex meta-learning setups. Future work could explore if this phenomenon generalizes to more complex tasks and model architectures.