Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 164 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

OLS as Attention Mechanism

Updated 12 October 2025
  • The paper demonstrates that ordinary least squares can be reinterpreted as an attention mechanism through similarity-based weighting, integrating theoretical, computational, and empirical insights.
  • OLS regression is framed as aggregating information via inner product similarities, paralleling transformer-style attention in modern neural networks.
  • The analysis extends to practical models like ensemble regularization, adaptive filtering, and local linear attention, revealing enhanced in-context learning and algorithmic efficiency.

Ordinary least squares (OLS) can be interpreted as an attention mechanism when classical regression analysis is reframed in the context of similarity-based prediction and optimal information integration. This perspective draws connections between OLS and the computation performed in Transformer-style attention modules, both of which generate outputs by aggregating information weighted by learned or statistically optimal coefficients. Recent research has elucidated this connection through diverse methodologies, including similarity-based embeddings, ensemble regularization, extensions to linear self-attention, tensor calculus, and adaptive filtering. The following sections provide a comprehensive survey of OLS as an attention mechanism, integrating theoretical, computational, and empirical considerations.

1. OLS Predictions as Attention: Similarity-Based Reformulation

The standard OLS prediction for new data Xtest\boldsymbol{X}_\text{test} is given by

y^test=Xtestβ^,\hat{\boldsymbol{y}}_\text{test} = \boldsymbol{X}_\text{test} \hat{\boldsymbol{\beta}},

where

β^=(XtrainXtrain)1Xtrainytrain.\hat{\boldsymbol{\beta}} = (\boldsymbol{X}_\text{train}' \boldsymbol{X}_\text{train})^{-1} \boldsymbol{X}_\text{train}' \boldsymbol{y}_\text{train}.

By eigendecomposition of XtrainXtrain=UΛU\boldsymbol{X}_\text{train}' \boldsymbol{X}_\text{train} = \boldsymbol{U} \Lambda \boldsymbol{U}', one may rewrite the test and train inputs in a transformed regressor (factor) space: Ftrain=XtrainUΛ1/2,Ftest=XtestUΛ1/2,\boldsymbol{F}_{\text{train}} = \boldsymbol{X}_{\text{train}} \boldsymbol{U}\Lambda^{-1/2}, \quad \boldsymbol{F}_{\text{test}} = \boldsymbol{X}_{\text{test}} \boldsymbol{U} \Lambda^{-1/2}, yielding

y^test=FtestFtrainytrain.\hat{\boldsymbol{y}}_\text{test} = \boldsymbol{F}_{\text{test}} \boldsymbol{F}_{\text{train}}' \boldsymbol{y}_\text{train}.

This formulation reveals that OLS prediction is a weighted sum over training outcomes, with weights determined by inner products (similarity) in the factor space. For each test point jj,

y^j=i=1NFj,Fiyi,\hat{y}_j = \sum_{i=1}^N \langle F_j, F_i \rangle y_i,

establishing that OLS "attends" more to training points similar to the query data in the transformed space. The encoding and decoding steps,

Wtrain=UΛ1/2,Wtest=UΛ1/2,\boldsymbol{W}_{\text{train}} = \boldsymbol{U} \Lambda^{-1/2}, \quad \boldsymbol{W}_{\text{test}} = \boldsymbol{U} \Lambda^{-1/2},

correspond to the query, key, and value architecture in attention modules (Coulombe, 13 Apr 2025). If the softmax in standard attention is replaced by an identity activation, OLS can be seen as a restricted, similarity-based linear attention mechanism.

2. Implicit Regularization via OLS Ensembles and Feature Masking

Ensembles of OLS predictors constructed by fitting on random subsets of features or examples produce an averaging effect that mimics explicit regularization. For data (XRn×p,yRn)(X \in \mathbb{R}^{n \times p}, y \in \mathbb{R}^n), each base estimator β(i)\beta^{(i)} fits on subsampled features SiS_i and samples TiT_i: β(i)=Si[(TiXSi)]Tiy,\beta^{(i)} = S_{i} [ (T_i' X S_i)^\dagger ] T_i' y, where SiS_i selects features and ()(\cdot)^\dagger is the pseudoinverse. The ensemble predictor is the mean: βens=1ki=1kβ(i).\beta^{\text{ens}} = \frac{1}{k} \sum_{i=1}^k \beta^{(i)}. Under Gaussianity, with the limiting feature proportion α\alpha and sample-ratio η\eta, the asymptotic risk of the large ensemble aligns exactly with the optimally tuned ridge regression: Rαens=(1α)2+σ2α2γ1α2γ,R_\alpha^{\text{ens}} = \frac{(1-\alpha)^2 + \sigma^2 \alpha^2 \gamma}{1-\alpha^2 \gamma}, where γ=limp/n\gamma = \lim p/n and σ2\sigma^2 is output noise variance. Optimizing α\alpha yields

minα<γ1Rαens=infλR(β^λridge),\min_{\alpha < \gamma^{-1}} R_\alpha^{\text{ens}} = \inf_\lambda R(\hat{\beta}_\lambda^\text{ridge}),

indicating that OLS ensembles "attend" to subsampled feature sets, and this implicit stochastic masking acts as a form of structured, adaptive (ridge-like) regularization (LeJeune et al., 2019). The process is analogous to multi-head attention, where each head attends to a different feature subset, and the aggregator controls bias and variance through averaging.

3. In-Context Learning and Connections to Softmax Regression

Transformer-based LLMs exhibit in-context learning by performing regression directly via self-attention. In softmax regression formulation,

minxexp(Ax)exp(Ax),1nb2,\min_x \left\| \frac{\exp(Ax)}{\langle \exp(Ax),\, \mathbf{1}_n \rangle} - b \right\|_2,

the prediction function

f(x)=exp(Ax)exp(Ax),1nf(x) = \frac{\exp(Ax)}{\langle \exp(Ax),\, \mathbf{1}_n \rangle}

emulates the softmax-attention output (analogous to D1exp(QK)VD^{-1} \exp(QK^\top)V in Transformers). Gradient descent on this loss,

L(x)=12f(x)b22,L(x) = \frac{1}{2} \Bigl\| f(x) - b \Bigr\|_2^2,

yields controlled, bounded stepwise updates, similar in effect to self-attention layer updates. Specifically, Lipschitz analysis establishes that

f(xt+1)f(xt)2Mxt+1xt2,\|f(x_{t+1}) - f(x_t)\|_2 \leq M \|x_{t+1} - x_t\|_2,

where MM depends exponentially on bounds for AA and the input—mirroring the bounded changes during attention operations (Li et al., 2023). This analysis implies that OLS-like regression and attention mechanisms in Transformers both effect similar, bounded transformations during in-context training.

4. OLS and Attention via Tensor Calculus and Generalized Linear Mapping

By lifting regression analysis into Hilbert spaces and tensor products, OLS can be represented as a universal linear mapping akin to attention. For feature-target pairs {(uk,f(uk))}\{(u_k, f(u_k))\}, OLS seeks a tensor BB such that

f(u)B1u,f(u) \approx B\llbracket 1\rrbracket u,

with residual minimization

L(X)=k=1pf(uk)Xuk2,\mathcal{L}(X) = \sum_{k=1}^p \|f(u_k) - X \otimes u_k\|^2,

and closed-form solution

B=ZX(XTX)1,\mathbf{B} = ZX (X^T X)^{-1},

where ZXZX is the target matrix and XX the feature matrix (Algarte, 19 Nov 2024). In attention modules,

Attention(Q,K,V)=softmax(QKTdk)V,\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V,

the softmax computes dynamic, data-dependent weights. In the tensor OLS formulation, the optimal BB acts as a unique transformation that "attends" to feature axes for error minimization. Practical implications include clearer interpretability of attention weights and extendibility to higher-order or multi-head attention.

5. Extended Linear Self-Attention and OLS Algorithm Emulation

Extended linear self-attention (ELSA) introduces explicit bias matrices into the standard attention parametrization,

ELSA(A)=(W3+B3)(W1+B1)T(W2+B2),\text{ELSA}(A) = (W_3 + B_3) (W_1 + B_1)^T (W_2 + B_2),

where WlW_l are weights and BlB_l are biases (Hagiwara, 31 Mar 2025). This parameterization allows flexible "mask-and-move" operations to select, insert, or manipulate submatrices and constants—effectively enabling direct emulation of key algorithmic primitives required for OLS and ridge regression, such as

β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y

and the iterative gradient descent update

βt=βt1η(XTXβt1XTy).\beta_t = \beta_{t-1} - \eta (X^T X \beta_{t-1} - X^T y).

Bias matrices can create skip connections or constant output components essential for algorithmic representation. This represents an expressive generalization of standard attention that facilitates in-context algorithmic implementations, including OLS, via prompt and parameter design.

6. Adaptive Filter Attention and the Estimation-Theoretic View

Adaptive Filter Attention (AFA) reframes attention as optimal state estimation for linear stochastic differential equation (SDE) models (Racioppo, 4 Sep 2025). Each input token is a noisy measurement of a latent state, and temporal propagation uses the transition matrix AA: x(t)eA(tjti)z(tj).x(t) \gets e^{A (t_j - t_i)} z(t_j). Pairwise measurement uncertainty is propagated via the solution to the differential Lyapunov equation: dVF/ds=AVF+VFAT+Q.dV_F/ds = A V_F + V_F A^T + Q. Maximum likelihood estimation aggregates multiple propagated measurements by precision-weighted averaging: xˉi=(jPij)1jPijx^ij,\bar{x}_i = \left(\sum_j P_{ij}\right)^{-1} \sum_j P_{ij} \hat{x}_{i \to j}, with PijP_{ij} the inverse covariance. This operation is mathematically equivalent to OLS (or IRLS when robust weights wij[1+(rijTPijrij)/ν]1w_{ij} \sim [1 + (r_{ij}^T P_{ij} r_{ij}) / \nu]^{-1} are used), establishing that attention in this setting is an optimal least squares estimator. In the limit where the dynamics vanish, standard dot-product softmax attention is recovered.

7. Local Linear Attention: Test-Time Regression as OLS

Local Linear Attention (LLA) is derived from nonparametric local linear regression, generalizing the standard softmax (Nadaraya-Watson estimator) attention to include first-order corrections—effectively performing a local OLS regression per query (Zuo et al., 1 Oct 2025). The weighted least squares objective,

minb,W12j=1iwijvjbW(kjqi)2+λWF2,\min_{b, W} \frac{1}{2} \sum_{j=1}^{i} w_{ij} \|v_j - b - W(k_j - q_i)\|^2 + \lambda\|W\|_F^2,

gives rise to the attention weights

sij=wij(1(kjqi)Tρi)ωiμiTρi,ρi=Σi1μi,s_{ij} = \frac{w_{ij} (1 - (k_j - q_i)^T \rho_i)}{\omega_i - \mu_i^T \rho_i}, \quad \rho_i = \Sigma_i^{-1} \mu_i,

where

ωi=j=1iwij,μi=j=1iwij(kjqi),Σi=j=1iwij(kjqi)(kjqi)T+λI.\omega_i = \sum_{j=1}^{i} w_{ij}, \quad \mu_i = \sum_{j=1}^{i} w_{ij}(k_j - q_i), \quad \Sigma_i = \sum_{j=1}^{i} w_{ij}(k_j - q_i)(k_j - q_i)^T + \lambda I.

The local linear correction reduces boundary bias, and theoretical analysis proves improved integrated mean squared error: E[DfLL(x)f(x)2dx]=O(n4/(d+4)),E[\int_D \|f_{LL}(x) - f(x)\|^2 dx] = O(n^{-4/(d+4)}), as compared to softmax attention’s O(n3/(d+3))O(n^{-3/(d+3)}) rate. LLA can be efficiently implemented via blockwise algorithms and matrix-free inversion (FlashLLA), ensuring scalability. Experimental results confirm that LLA, interpreted as OLS-based local regression, improves in-context learning, associative recall, and adaptability to non-stationarity.

Conclusion

The equivalence between OLS and attention mechanisms has been formalized across multiple lines of research. OLS, classically understood as the optimal linear estimator, maps naturally onto the structure and operations found in attention modules, whether through similarity-based aggregation, implicit regularization, algorithmic emulation, or adaptive filtering. The computational primitives underlying regression—inner products, masking, aggregation, and inversion—correspond directly to the mechanisms driving attention in modern neural architectures. This synthesis not only provides a rigorous, interpretable foundation for attention models in deep learning but also opens avenues for hybrid OLS-attention algorithms with desirable statistical and computational properties.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ordinary Least Squares as an Attention Mechanism.