OLS-Attention Nexus: Bridging Regression & Attention

Updated 12 October 2025

The OLS-Attention Nexus is a conceptual framework that recasts classical ordinary least squares regression as an attention mechanism by interpreting predictions as weighted sums based on similarity.
It bridges traditional statistical estimation with modern machine learning, showing how eigen-decomposition and sparse recovery techniques inform the design of attention-based models.
The approach supports tuning-free, robust inference and adaptive variable selection in high-dimensional settings, aligning regression with dynamic feature attention in neural networks.

The OLS-Attention Nexus denotes the connection between ordinary least squares (OLS) regression and modern attention mechanisms, including analogies in statistical estimation, machine learning, sparse recovery, and econometrics. Recent theoretical work shows that OLS predictions can be understood as restricted attention modules operating on similarity in transformed regressor spaces, with direct parallels to query-key-value structures in Transformer architectures. This conceptual bridge illuminates shared mathematical mechanisms, sheds light on sparsity and selection criteria, and highlights how classical regression ideas underpin key features of deep learning models.

1. OLS as an Attention Mechanism

The mapping of OLS into an attention framework is established via the transformation: $\hat{y}_{\text{test}} = X_{\text{test}} (X_{\text{train}}' X_{\text{train}})^{-1} X_{\text{train}}' y_{\text{train}}$ Upon eigendecomposition: $(X_{\text{train}}' X_{\text{train}})^{-1} = U \Lambda^{-1} U'$ where $U$ and $\Lambda$ are the eigenvectors and eigenvalues, OLS predictions can be recast as

$\hat{y}_{\text{test}} = F_{\text{test}} F_{\text{train}}' y_{\text{train}}$

with factor scores $F = X U \Lambda^{-1/2}$ . In this form, every test outcome is a weighted sum of training outcomes, with weights determined by the similarity $\langle F_j, F_i \rangle$ between encoded training and test vectors. This mirrors the query-key-value attention mechanism, where the "queries" are encoded test features, "keys" are encoded training features, and "values" are outcomes (Coulombe, 13 Apr 2025).

2. Similarity Structures and Embedding Optimization

The OLS-attention reinterpretation frames regression not as estimation of coefficients, but as construction of an embedding that minimizes prediction error by optimally encoding predictors for similarity-based aggregation. The prediction for test case $j$ is

$\hat{y}_j = \sum_{i=1}^{N} \langle F_j, F_i \rangle y_i$

where $\langle F_j, F_i \rangle$ captures both magnitude and cosine similarity in the decorrelated factor space. The alternative optimization objective becomes: $\min_{\Omega} \| y - X_{\text{train}} \Omega X_{\text{train}}' y \|^2$ with optimal mapping $\Omega = (X'_{\text{train}} X_{\text{train}})^{-1}$ . This aligns with the attention paradigm where encoding and decoding matrices are learned to compare inputs by inner products, paralleling the self-attention operation (Coulombe, 13 Apr 2025).

3. Sparse Recovery, Coherence, and Greedy Pursuit

Sparse recovery algorithms such as OLS and orthogonal matching pursuit (OMP) can be understood through their coherence-based selection and iterative attention to dictionary atoms. The mutual coherence $\mu = \max_{i \neq j} |\langle a_i, a_j \rangle|$ governs when exact support recovery is possible. Notably, when partial support is known (i.e., some true atoms have already been correctly attended to), the recovery bound relaxes from Tropp's classical result $\mu < 1/(2k-1)$ to

$\mu < \frac{1}{2k - \ell - 1}$

with $\ell < k$ the number of support elements already selected. This illustrates how initial correct attention leads to relaxed constraints in subsequent selection, drawing a direct parallel between greedy pursuit strategies and sequential attention mechanisms in neural networks (Herzet et al., 2012).

4. Tuning-Free and Statistics-Oblivious Sparse Attention

Tuning-free incremental greedy pursuits (TF-IGP) and residual ratio threshold (RRT-IGP) frameworks enable OLS/OMP methods to perform sparse recovery without prior knowledge of sparsity or noise statistics. These frameworks track the residual ratio

$\text{RR}(k) = \frac{||r^{(k)}||_2}{||r^{(k-1)}||_2}$

to dynamically determine the stopping point—effectively letting the data dictate how many elements should be attended to. These adaptive approaches operate under restricted isometry property (RIP) guarantees and suggest an architecture where attention weights are assigned and the number of attended features is dynamically chosen, aligning with the adaptive nature of neural attention blocks (Kallummil et al., 2017).

5. Parsimonious Regression, Covariance, and Sparse Attention in Inference

In high-dimensional inference, sparse attention is operationalized by running individual OLS regressions on subsets (or singletons) of regressors. The positive definiteness of the asymptotic covariance matrix $V$ of OLS estimators in such parsimonious regressions ensures reliable and efficient inference, notably for max-type tests: $V = D^{-1} (A_{xx} - A_{zx}' A_{zz}^{-1} A_{zx}) D^{-1}$ This property underpins simulation-based p-value calculation and provides statistical foundations for sparse attention channels in econometrics, supporting robust variable selection and significance testing in overparameterized models (Nagakura, 2020).

6. Weighted-Average Interpretation and Attention in IV vs OLS Estimation

The difference between IV and OLS estimates is interpreted through the lens of "attention" (weighting) over covariate and treatment margins. Both IV and OLS coefficients admit expressions as weighted averages: $\beta_{IV} = \iint \left[ \frac{\partial g(x, w)}{\partial x} \right] \omega_Z(x,w) \, dF_W(w) dx$

$\beta_{OLS} = \iint \left[ \frac{\partial m(x, w)}{\partial x} \right] \omega_X(x,w) \, dF_W(w) dx$

The IV-OLS gap decomposes into differences over covariate weights, treatment-level weights, and endogeneity bias. This diagnostics framework identifies how IV and OLS attend differentially across population subgroups and treatment intensities, echoing attention allocation in neural networks (Ishimaru, 2021).

7. Orthogonal Attention, Compression, and Scaling

Modern attention modules such as LAVO (Linear Attention via Orthogonal Memory) operationalize attention as projection into a fixed set of orthogonal bases, compressing a long-range context into a bounded memory: $\tilde{X} = B \odot H; \quad H = \frac{1}{n} \sum_{t=1}^n (B \cdot x_t)$ Such designs minimize redundancy and parallel the use of orthogonal projections in OLS, with sequential updates and local context windows capturing both global and fine-grained details. Conceptually, this memory compression is akin to regression onto a basis, facilitating efficient long-context attention in causal language modeling (Zhang et al., 2023).

8. Dimensionality Reduction: PLS, OLS, and Krylov Subspaces

Dimensionality reduction via partial least squares (PLS) is tightly connected to OLS, with PLS solutions residing in Krylov subspaces generated by the regressor covariance structure. The proximity between PLS and OLS estimators,

$\|\widehat{\beta}_{PLS}^{(L)} - \widehat{\beta}_{OLS}\|^2 = \min_{Q_L \in \Omega_L} \sum_{d=1}^D Q_L(\lambda_d)^2 \lambda_d \xi_d^2$

is dictated by eigenvalue clustering in $X'X$ . When clusters are tight, PLS approximates OLS with few components; heterogeneous spectral distributions require more components. This provides theoretical guidelines for hybrid attention systems leveraging low-dimensional embeddings (Val et al., 2023).

9. Robustness of OLS in Heavy-Tail Estimation

In estimation of Pareto tail exponents, the OLS estimator—after appropriate small-sample correction—offers robustness compared with the classical Hill MLE, especially when Pareto behavior holds only in the upper tail or when distributions are merely regularly varying. The shifted OLS estimator applies uniform weights, minimizing excessive influence from small observations: $\hat{d}_{sOLS} = g(n) \cdot \hat{d}_{OLS}$ with $g(n)$ a bias-correcting multiplicative factor. Empirical analyses demonstrate greater stability and less sensitivity to support choices, which suggests a practical advantage of OLS-based attention for distributional inference in the presence of structural model uncertainty (Santos et al., 16 Sep 2024).

The OLS-Attention Nexus formalizes deep mathematical and algorithmic correspondences between least squares regression and attention mechanisms. By reframing regression and feature selection as similarity-driven encoding, embedding optimization, and adaptive focus over sparse supports, this body of research clarifies how statistical estimation procedures can illuminate and improve both traditional and modern attention-based inference and learning systems.