OLS as Attention Mechanism

Updated 12 October 2025

The paper demonstrates that ordinary least squares can be reinterpreted as an attention mechanism through similarity-based weighting, integrating theoretical, computational, and empirical insights.
OLS regression is framed as aggregating information via inner product similarities, paralleling transformer-style attention in modern neural networks.
The analysis extends to practical models like ensemble regularization, adaptive filtering, and local linear attention, revealing enhanced in-context learning and algorithmic efficiency.

Ordinary least squares (OLS) can be interpreted as an attention mechanism when classical regression analysis is reframed in the context of similarity-based prediction and optimal information integration. This perspective draws connections between OLS and the computation performed in Transformer-style attention modules, both of which generate outputs by aggregating information weighted by learned or statistically optimal coefficients. Recent research has elucidated this connection through diverse methodologies, including similarity-based embeddings, ensemble regularization, extensions to linear self-attention, tensor calculus, and adaptive filtering. The following sections provide a comprehensive survey of OLS as an attention mechanism, integrating theoretical, computational, and empirical considerations.

1. OLS Predictions as Attention: Similarity-Based Reformulation

The standard OLS prediction for new data $\boldsymbol{X}_\text{test}$ is given by

$\hat{\boldsymbol{y}}_\text{test} = \boldsymbol{X}_\text{test} \hat{\boldsymbol{\beta}},$

where

$\hat{\boldsymbol{\beta}} = (\boldsymbol{X}_\text{train}' \boldsymbol{X}_\text{train})^{-1} \boldsymbol{X}_\text{train}' \boldsymbol{y}_\text{train}.$

By eigendecomposition of $\boldsymbol{X}_\text{train}' \boldsymbol{X}_\text{train} = \boldsymbol{U} \Lambda \boldsymbol{U}'$ , one may rewrite the test and train inputs in a transformed regressor (factor) space: $\boldsymbol{F}_{\text{train}} = \boldsymbol{X}_{\text{train}} \boldsymbol{U}\Lambda^{-1/2}, \quad \boldsymbol{F}_{\text{test}} = \boldsymbol{X}_{\text{test}} \boldsymbol{U} \Lambda^{-1/2},$ yielding

$\hat{\boldsymbol{y}}_\text{test} = \boldsymbol{F}_{\text{test}} \boldsymbol{F}_{\text{train}}' \boldsymbol{y}_\text{train}.$

This formulation reveals that OLS prediction is a weighted sum over training outcomes, with weights determined by inner products (similarity) in the factor space. For each test point $j$ ,

$\hat{y}_j = \sum_{i=1}^N \langle F_j, F_i \rangle y_i,$

establishing that OLS "attends" more to training points similar to the query data in the transformed space. The encoding and decoding steps,

$\boldsymbol{W}_{\text{train}} = \boldsymbol{U} \Lambda^{-1/2}, \quad \boldsymbol{W}_{\text{test}} = \boldsymbol{U} \Lambda^{-1/2},$

correspond to the query, key, and value architecture in attention modules (Coulombe, 13 Apr 2025). If the softmax in standard attention is replaced by an identity activation, OLS can be seen as a restricted, similarity-based linear attention mechanism.

2. Implicit Regularization via OLS Ensembles and Feature Masking

Ensembles of OLS predictors constructed by fitting on random subsets of features or examples produce an averaging effect that mimics explicit regularization. For data $(X \in \mathbb{R}^{n \times p}, y \in \mathbb{R}^n)$ , each base estimator $\beta^{(i)}$ fits on subsampled features $S_i$ and samples $T_i$ : $\beta^{(i)} = S_{i} [ (T_i' X S_i)^\dagger ] T_i' y,$ where $S_i$ selects features and $(\cdot)^\dagger$ is the pseudoinverse. The ensemble predictor is the mean: $\beta^{\text{ens}} = \frac{1}{k} \sum_{i=1}^k \beta^{(i)}.$ Under Gaussianity, with the limiting feature proportion $\alpha$ and sample-ratio $\eta$ , the asymptotic risk of the large ensemble aligns exactly with the optimally tuned ridge regression: $R_\alpha^{\text{ens}} = \frac{(1-\alpha)^2 + \sigma^2 \alpha^2 \gamma}{1-\alpha^2 \gamma},$ where $\gamma = \lim p/n$ and $\sigma^2$ is output noise variance. Optimizing $\alpha$ yields

$\min_{\alpha < \gamma^{-1}} R_\alpha^{\text{ens}} = \inf_\lambda R(\hat{\beta}_\lambda^\text{ridge}),$

indicating that OLS ensembles "attend" to subsampled feature sets, and this implicit stochastic masking acts as a form of structured, adaptive (ridge-like) regularization (LeJeune et al., 2019). The process is analogous to multi-head attention, where each head attends to a different feature subset, and the aggregator controls bias and variance through averaging.

3. In-Context Learning and Connections to Softmax Regression

Transformer-based LLMs exhibit in-context learning by performing regression directly via self-attention. In softmax regression formulation,

$\min_x \left\| \frac{\exp(Ax)}{\langle \exp(Ax),\, \mathbf{1}_n \rangle} - b \right\|_2,$

the prediction function

$f(x) = \frac{\exp(Ax)}{\langle \exp(Ax),\, \mathbf{1}_n \rangle}$

emulates the softmax-attention output (analogous to $D^{-1} \exp(QK^\top)V$ in Transformers). Gradient descent on this loss,

$L(x) = \frac{1}{2} \Bigl\| f(x) - b \Bigr\|_2^2,$

yields controlled, bounded stepwise updates, similar in effect to self-attention layer updates. Specifically, Lipschitz analysis establishes that

$\|f(x_{t+1}) - f(x_t)\|_2 \leq M \|x_{t+1} - x_t\|_2,$

where $M$ depends exponentially on bounds for $A$ and the input—mirroring the bounded changes during attention operations (Li et al., 2023). This analysis implies that OLS-like regression and attention mechanisms in Transformers both effect similar, bounded transformations during in-context training.

4. OLS and Attention via Tensor Calculus and Generalized Linear Mapping

By lifting regression analysis into Hilbert spaces and tensor products, OLS can be represented as a universal linear mapping akin to attention. For feature-target pairs $\{(u_k, f(u_k))\}$ , OLS seeks a tensor $B$ such that

$f(u) \approx B\llbracket 1\rrbracket u,$

with residual minimization

$\mathcal{L}(X) = \sum_{k=1}^p \|f(u_k) - X \otimes u_k\|^2,$

and closed-form solution

$\mathbf{B} = ZX (X^T X)^{-1},$

where $ZX$ is the target matrix and $X$ the feature matrix (Algarte, 19 Nov 2024). In attention modules,

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,$

the softmax computes dynamic, data-dependent weights. In the tensor OLS formulation, the optimal $B$ acts as a unique transformation that "attends" to feature axes for error minimization. Practical implications include clearer interpretability of attention weights and extendibility to higher-order or multi-head attention.

5. Extended Linear Self-Attention and OLS Algorithm Emulation

Extended linear self-attention (ELSA) introduces explicit bias matrices into the standard attention parametrization,

$\text{ELSA}(A) = (W_3 + B_3) (W_1 + B_1)^T (W_2 + B_2),$

where $W_l$ are weights and $B_l$ are biases (Hagiwara, 31 Mar 2025). This parameterization allows flexible "mask-and-move" operations to select, insert, or manipulate submatrices and constants—effectively enabling direct emulation of key algorithmic primitives required for OLS and ridge regression, such as

$\hat{\beta} = (X^T X)^{-1} X^T y$

and the iterative gradient descent update

$\beta_t = \beta_{t-1} - \eta (X^T X \beta_{t-1} - X^T y).$

Bias matrices can create skip connections or constant output components essential for algorithmic representation. This represents an expressive generalization of standard attention that facilitates in-context algorithmic implementations, including OLS, via prompt and parameter design.

6. Adaptive Filter Attention and the Estimation-Theoretic View

Adaptive Filter Attention (AFA) reframes attention as optimal state estimation for linear stochastic differential equation (SDE) models (Racioppo, 4 Sep 2025). Each input token is a noisy measurement of a latent state, and temporal propagation uses the transition matrix $A$ : $x(t) \gets e^{A (t_j - t_i)} z(t_j).$ Pairwise measurement uncertainty is propagated via the solution to the differential Lyapunov equation: $dV_F/ds = A V_F + V_F A^T + Q.$ Maximum likelihood estimation aggregates multiple propagated measurements by precision-weighted averaging: $\bar{x}_i = \left(\sum_j P_{ij}\right)^{-1} \sum_j P_{ij} \hat{x}_{i \to j},$ with $P_{ij}$ the inverse covariance. This operation is mathematically equivalent to OLS (or IRLS when robust weights $w_{ij} \sim [1 + (r_{ij}^T P_{ij} r_{ij}) / \nu]^{-1}$ are used), establishing that attention in this setting is an optimal least squares estimator. In the limit where the dynamics vanish, standard dot-product softmax attention is recovered.

7. Local Linear Attention: Test-Time Regression as OLS

Local Linear Attention (LLA) is derived from nonparametric local linear regression, generalizing the standard softmax (Nadaraya-Watson estimator) attention to include first-order corrections—effectively performing a local OLS regression per query (Zuo et al., 1 Oct 2025). The weighted least squares objective,

$\min_{b, W} \frac{1}{2} \sum_{j=1}^{i} w_{ij} \|v_j - b - W(k_j - q_i)\|^2 + \lambda\|W\|_F^2,$

gives rise to the attention weights

$s_{ij} = \frac{w_{ij} (1 - (k_j - q_i)^T \rho_i)}{\omega_i - \mu_i^T \rho_i}, \quad \rho_i = \Sigma_i^{-1} \mu_i,$

where

$\omega_i = \sum_{j=1}^{i} w_{ij}, \quad \mu_i = \sum_{j=1}^{i} w_{ij}(k_j - q_i), \quad \Sigma_i = \sum_{j=1}^{i} w_{ij}(k_j - q_i)(k_j - q_i)^T + \lambda I.$

The local linear correction reduces boundary bias, and theoretical analysis proves improved integrated mean squared error: $E[\int_D \|f_{LL}(x) - f(x)\|^2 dx] = O(n^{-4/(d+4)}),$ as compared to softmax attention’s $O(n^{-3/(d+3)})$ rate. LLA can be efficiently implemented via blockwise algorithms and matrix-free inversion (FlashLLA), ensuring scalability. Experimental results confirm that LLA, interpreted as OLS-based local regression, improves in-context learning, associative recall, and adaptability to non-stationarity.

Conclusion

The equivalence between OLS and attention mechanisms has been formalized across multiple lines of research. OLS, classically understood as the optimal linear estimator, maps naturally onto the structure and operations found in attention modules, whether through similarity-based aggregation, implicit regularization, algorithmic emulation, or adaptive filtering. The computational primitives underlying regression—inner products, masking, aggregation, and inversion—correspond directly to the mechanisms driving attention in modern neural architectures. This synthesis not only provides a rigorous, interpretable foundation for attention models in deep learning but also opens avenues for hybrid OLS-attention algorithms with desirable statistical and computational properties.