Papers
Topics
Authors
Recent
2000 character limit reached

Attention-Lasso: Sparse Local Modeling

Updated 17 December 2025
  • Attention-lasso is a supervised learning technique that enhances lasso regression by integrating attention-based instance weighting to build local adaptive models.
  • It employs ridge-softmax and random-forest proximity schemes to compute attention weights, ensuring that relevant examples influence each test-specific lasso fit.
  • Empirical findings reveal significant improvements in prediction accuracy and interpretability across tabular, time-series, spatial, and multi-omics data.

Attention-lasso refers to a class of supervised learning algorithms that combine the feature selection and sparsity of lasso regression with attention-based instance weighting mechanisms. These methods flexibly learn local models for individual prediction points by selectively emphasizing relevant training examples according to a supervised similarity metric. The approach improves adaptability to heterogeneous data structures, retains interpretability of model decisions, and can be generalized to settings involving tabular, time-series, spatial, and multi-omics data (Craig et al., 10 Dec 2025, Alharbi et al., 30 Aug 2024).

1. Mathematical Formalism and Objective Function

For traditional lasso regression, the model minimizes the 1\ell_1-penalized least-squares loss across all training samples. Attention-lasso extends this by introducing attention weights for each training example, specific to each test point xx^*. Let XRn×pX\in\mathbb{R}^{n\times p} be the training data, yRny\in\mathbb{R}^n the responses, and xRpx^*\in\mathbb{R}^p a test point. Define attention weights a1(x),...,an(x)0a_1(x^*),...,a_n(x^*)\geq0 with i=1nai(x)=1\sum_{i=1}^n a_i(x^*)=1. The attention-weighted lasso solves:

β^(x)=argminβRpi=1nai(x)(yixiβ)2+λβ1,\hat{\beta}(x^*) = \underset{\beta\in\mathbb{R}^p}{\arg\min} \sum_{i=1}^n a_i(x^*) (y_i - x_i^\top \beta)^2 + \lambda\|\beta\|_1,

where λ0\lambda\geq0 is the regularization parameter shared across all local fits (Craig et al., 10 Dec 2025).

In multi-omics graph settings (as in LASSO-MOGAT), lasso is first used to sparsify features per omics modality. Then, selected features become nodes in a protein-protein interaction (PPI) graph, with Graph Attention Network (GAT) layers learning edge-level attention coefficients:

eij=LeakyReLU(a[WhiWhj]);αij=exp(eij)kN(i)exp(eik).e_{ij} = \mathrm{LeakyReLU}(a^\top[Wh_i \| Wh_j]); \quad \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in\mathcal N(i)} \exp(e_{ik})}.

Feature vectors are aggregated via αij\alpha_{ij}, propagating attention across molecular networks (Alharbi et al., 30 Aug 2024).

2. Construction of Attention Weights

Two principal schemes have been formalized for tabular Attention-Lasso (Craig et al., 10 Dec 2025):

  • Ridge-based Softmax Attention: Fit ridge regression yXβridgey\approx X\beta^\text{ridge}, βridge=argminyXβ22+αβ22\beta^\text{ridge}=\arg\min\|y-X\beta\|_2^2+\alpha\|\beta\|_2^2. Feature importances W=Diag(β1ridge,...,βpridge)W = \mathrm{Diag}(|\beta^\text{ridge}_1|,...,|\beta^\text{ridge}_p|) are used to score similarity:

si(x)=xWxi,ai(x)=exp(si(x)/τ)jexp(sj(x)/τ),s_i(x^*) = x^{*\top} W x_i, \quad a_i(x^*) = \frac{\exp(s_i(x^*)/\tau)}{\sum_j \exp(s_j(x^*)/\tau)},

where τ\tau modulates concentration versus diffusion of attention.

  • Random-Forest Proximity-Based Attention: Fit a random forest on (X,y)(X,y); for x,xix^*,x_i, the proximity prox(x,xi)\text{prox}(x^*,x_i) is defined as the fraction of trees where xx^* and xix_i share a terminal leaf. Attention weights:

ai(x)=exp(prox(x,xi)/τ)jexp(prox(x,xj)/τ)a_i(x^*) = \frac{\exp(\text{prox}(x^*,x_i)/\tau)}{\sum_j \exp(\text{prox}(x^*,x_j)/\tau)}

These weighting schemes allow the local lasso model for xx^* to prioritize training examples likely to share relevant predictive structure.

3. Algorithmic Pipeline and Computational Considerations

The standard computational procedure is:

  1. Fit a random forest to (X,y)(X,y) to derive similarity measures (attention weights).
  2. For each test xix^*_i, compute aj(xi)a_j(x^*_i) for all training points.
  3. Fit a global lasso model on all data to select λ\lambda via cross-validation.
  4. For each test point, solve an attention-weighted lasso using shared λ\lambda.
  5. Combine baseline and attention-lasso predictions via a convex blend, yfinal(xi)=(1m)y^base(xi)+my^attn(xi)y_\text{final}(x^*_i) = (1-m)\hat{y}_\text{base}(x^*_i) + m\hat{y}_\text{attn}(x^*_i), for m[0,1]m\in[0,1], chosen by CV.

The localized lasso fits are parallelizable and have computational complexity comparable to leave-one-out CV. The penalty λ\lambda is fixed by the global fit and reused for all localized models (Craig et al., 10 Dec 2025).

In the multi-omics graph domain, lasso feature selection is performed prior to constructing the GAT model. GAT then optimizes attention coefficients and propagates embeddings through multiple layers, followed by a final classification (Alharbi et al., 30 Aug 2024).

4. Theoretical Properties and Mixture Model Analysis

Under a two-component mixture model:

  • Zi{1,2}Z_i\in\{1,2\} with probabilities π1,π2\pi_1,\pi_2.
  • yi=xiβZi+εiy_i = x_i^\top\beta_{Z_i} + \varepsilon_i, εiN(0,σ2)\varepsilon_i\sim N(0,\sigma^2).
  • xiZi=kN(0,Σk)x_i|Z_i=k \sim N(0,\Sigma_k).

Standard lasso asymptotically estimates a global blend of subgroup coefficients, with cluster mismatch bias. Ideal attention weighting enables the test-point model to upweight data from the same latent subgroup, reducing mean squared error:

MSEatt/MSElasso(W2/π2)2<1\mathrm{MSE}_\text{att} / \mathrm{MSE}_\text{lasso} \to (W_2/\pi_2)^2 < 1

Attention-lasso thus strictly reduces irreducible bias from averaging over heterogeneous subgroups, assuming the similarity metric is effective (Craig et al., 10 Dec 2025).

5. Interpretability and Coefficient Analysis

Attention-lasso achieves interpretability at two levels:

  • Feature-level: Each local lasso model β^(x)\hat{\beta}(x^*) is sparse, highlighting features predictive for the specific test point.
  • Example-level: Attention weights a(x)a(x^*) provide an explicit ranking of training samples by relevance.

Aggregated coefficient vectors βˉ(xi)\bar{\beta}(x^*_i) can be clustered across many test points to reveal latent subgroups and their signature models. Visualization pipelines include heatmaps, dendrograms, and within-cluster comparisons of prediction squared error (PSE) (Craig et al., 10 Dec 2025).

In graph models, interpretability is afforded through analysis of high-attention edges, identifying key protein-protein interactions implicated in particular cancer types. For example, KRAS–BRAF and TP53–MDM2 edges receive the highest average attention coefficients in relevant contexts, whereas edges with α<0.001\alpha<0.001 are effectively disregarded (Alharbi et al., 30 Aug 2024).

6. Empirical Performance Across Domains

Empirical evaluation demonstrates that attention-lasso methods consistently outperform standard lasso in heterogeneous data settings, both in tabular and graph-based applications:

  • Tabular regression: Attention-lasso surpassed standard lasso on 11/12 UCI datasets, with average relative PSE improvements up to 93%, and was competitive with XGBoost, LightGBM, RF, and KNN (Craig et al., 10 Dec 2025).
  • Time-series and spatial prediction: Outperformed lasso by up to 54% lower PSE on GDP forecasting and improved AUC from 0.59 to 0.65 in spatial mass-spectrometry data.
  • Cancer multi-omics: In LASSO-MOGAT, single-omics GAT baselines yielded 87.6–92.2% accuracy, two-omics up to 94.1%, and full multi-omics via attention-lasso achieved 94.7% accuracy, macro-F1 = 0.8987 (Alharbi et al., 30 Aug 2024).

High attention coefficients were concentrated on biologically meaningful edges and modules, demonstrating enhanced interpretability and actionable insights.

7. Extensions, Practical Advice, and Limitations

Extensions to attention-lasso include:

  • Adaptation to other base learners (LightGBM, XGBoost) via weighted fitting or leaf-wise reweighting.
  • Data-drift adaptation with attention-weighted residual correction.
  • Heterogeneity estimation in causal inference settings using attention-weighted penalty terms.
  • Specialized attention mechanisms for time-series and spatial data via contextually defined similarity functions.

Recommended practices include fixing the 1\ell_1 penalty λ\lambda globally, using random-forest proximity for improved non-linear structure capture (default 500 trees), and choosing τ\tau to control soft/hard neighbor selection. For large-scale problems, approximate schemes and mini-batch constructions are suggested.

Potential pitfalls arise if the similarity measure fails to separate latent subgroups; in such cases, attention-weighting may not reduce bias. Diagnostic inspection of attention distributions and subgroup clustering remains essential (Craig et al., 10 Dec 2025).

Table: Key Algorithmic Components of Attention-Lasso

Component Tabular Attention-Lasso (Craig et al., 10 Dec 2025) Multi-Omics Graph (Alharbi et al., 30 Aug 2024)
Attention Construction Ridge-softmax & RF-proximity GAT edge coefficients on PPI graph
Feature Selection 1\ell_1 lasso (per test point/local) Offline 1\ell_1 lasso (per modality)
Interpretability Sparse β\beta, attention weights Edge-level attention in PPI network

The combination of local attention-based weighting with 1\ell_1 sparsity demonstrates robust gains in predictive performance and interpretability, with broad applicability across tabular, time-series, spatial, and multi-omics domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Attention-lasso.