uniLasso: Univariate-Guided Sparse Regression
- uniLasso is a sparse regression technique that leverages univariate effect sizes to stabilize and interpret multivariate models.
- It adopts a two-stage approach: first, conducting univariate regression screenings, then applying a nonnegative Lasso to enforce sign consistency.
- Empirical and theoretical studies show uniLasso achieves comparable predictive accuracy with sparser solutions relative to standard Lasso.
Univariate-Guided Sparse Regression, also termed "uniLasso", designates a class of statistical procedures for regression and feature selection that harness univariate effect sizes to stabilize and interpret multivariate sparse estimators. The methodology has been developed for linear, generalized linear, and survival models, with specific adaptations to high-dimensional biobank-scale -omics datasets. Its distinguishing principle is a two-stage approach: univariate regressions inform and constrain multivariate selection. This ensures sign-consistency between marginal and combined effects and preferentially retains variables with strong univariate predictive capacity. Across empirical and theoretical studies, uniLasso demonstrates improved stability, interpretable sign patterns, and sparser solutions relative to the standard Lasso, with comparable or better predictive accuracy in numerous regimes (Chatterjee et al., 30 Jan 2025, Richland et al., 27 Nov 2025).
1. Mathematical Formulation and Principle
Let denote the design matrix, the response vector, and the intercept and coefficients of the linear model . The uniLasso comprises two sequential steps:
Stage 1 (Univariate regression screening):
For each feature , fit the univariate linear model and compute the leave-one-out (LOO) prediction for each observation.
Stage 2 (Guided Nonnegative Lasso):
Solve $ (\hat\theta_0, \hat\theta) = \arg\min_{\theta_0\in\mathbb{R}, \;\theta\in\mathbb{R}<sup>p}</sup> \left{ \frac{1}{n} \sum_{i=1}<sup>n</sup> \left( y_i - \theta_0 - \sum_{j=1}<sup>p</sup> \theta_j \eta_j<sup>{-i}</sup> \right)<sup>2</sup> + \lambda \sum_{j=1}<sup>p</sup> |\theta_j| \right} \ \text{subject to} \ \theta_j \ge 0,\ \forall j $where$\lambda>0$ is the regularization parameter.
The final predictor for any is where , .
Nonstandardized features preserve the scale of univariate effects. The nonnegativity constraint ensures or .
2. Theoretical Guarantees
Under the model , with mean-zero, independent of features, and feature support , the population univariate slope is . If each active variable satisfies (sign-alignment), covariance of active features is nondegenerate, and variables have sub-Gaussian tails, then for appropriately chosen :
- All inactive coefficients are exactly zero.
- Each active coefficient satisfies . This holds with high probability ().
Contrast with standard Lasso, which requires incoherence (“irrepresentable”) conditions: strong collinearity between relevant and irrelevant features in the Gram matrix can result in spurious selections. UniLasso instead relies on small marginal correlations for inactive variables and sign-alignment for relevant effects, a weaker and more interpretable set of sufficient conditions.
Sign-alignment is preserved under the condition that for all , satisfies if and if ; thus, for each , preserves the sign of and (Chatterjee et al., 30 Jan 2025).
3. Algorithmic Procedure and Computational Aspects
Pseudocode (linear regression):
1 2 3 4 |
for j in range(p): # Fit univariate regression y ~ β_{0j} + β_j x_{ij} # Compute leave-one-out predictions η_j^{-i} for i in 1..n |
Stage 1 is distributable and achieves complexity (using hat matrix diagonal formulae). Stage 2 is handled by standard coordinate-descent Lasso solvers with nonnegativity constraints and no feature standardization (e.g., glmnet or Adelie). Parallelization and memory efficiency are achieved via filtering (e.g., minor allele frequency in genomic data) and denominator stabilization.
For biobank-scale omics datasets (), adaptations include:
- Minor-allele-frequency filtering (MAF excluded).
- Stabilization of denominators for features with very low variance.
- Logistic regression adaptation for binary traits using IRLS and leave-one-out formulae (Richland et al., 27 Nov 2025).
- Parallelization over features for fast computation.
4. Extensions and Generalizations
The two-stage uniLasso procedure generalizes to other response models:
- Generalized Linear Models (GLMs): Univariate fits via iteratively reweighted least squares; LOO approximations via influence-function formulas. Stage 2 applies the penalty and constraints on the new set of univariate predictors.
- Cox regression (survival analysis): Univariate fit by partial-likelihood score; second stage penalized Cox-Lasso (with nonnegativity if desired).
For scenarios with available external summary statistics (e.g., GWAS effect sizes), uniLasso enables a single-stage adaptively weighted Lasso formulation with enforced sign agreement. Given univariate slopes from external data, fit subject tofor all (Richland et al., 27 Nov 2025).
5. Model Selection, Interpretability, and Empirical Performance
Hyperparameters are selected via cross-validation, typically performed in Stage 2 on the composite model, minimizing prediction error. In biobank contexts, fixed train/validation/test splits are preferred over -fold CV for efficiency.
Interpretability gains are substantial: uniLasso’s sign-consistency property ( or ) eliminates sign-flips found in standard Lasso solutions. Variables with large marginal (univariate) effects incur less penalization and thus are preferentially selected. Empirically, uniLasso exhibits substantially sparser solutions than standard Lasso or PRS-CS, with matched or superior prediction. Table 1 from (Richland et al., 27 Nov 2025) quantifies sparsity:
| Phenotype | Lasso | uniLasso | uniLasso ES |
|---|---|---|---|
| Height | 55,038 | 34,256 | 44,788 |
| BMI | 33,076 | 18,833 | 24,598 |
| CHD | 2,615 | 1,009 | 1,310 |
| Asthma | 5,556 | 3,030 | 5,698 |
Predictive accuracy (R²/AUC) is comparable across all approaches, with uniLasso matching Lasso while selecting 40% fewer features.
| Phenotype | Lasso | uniLasso | uniLasso ES | PRS-CS |
|---|---|---|---|---|
| Height | 0.713 | 0.707 | 0.716 | 0.649 |
| BMI | 0.119 | 0.103 | 0.118 | 0.070 |
| Asthma | 0.619 | 0.620 | 0.623 | 0.596 |
| CHD | 0.758 | 0.757 | 0.760 | 0.760 |
Compute times for biobank-scale analyses are manageable with parallelization; uniLasso’s additional screening step incurs modest extra computation relative to Lasso, offset by faster training in its external-score variant.
6. Practical Recommendations and Limitations
- Nonnegativity constraints and avoidance of feature standardization in the guided Lasso stage are essential for maintaining the link between marginal and joint effects.
- MAF filtering and denominator stabilization are critical for numerical stability in very high .
- When signs of marginal effects are volatile among correlated variables, uniLasso may underperform Lasso; it is advised to compare out-of-sample errors for both methods.
- The method is especially suitable in settings where interpretability and sign stability are prioritized, such as genomics and epidemiological risk modeling.
- In applications with available external summary statistics, the uniLasso ES variant is recommended due to training efficiency and preserved sparsity.
A plausible implication is that uniLasso, by enforcing sign-concordance and capitalizing on marginal effect sizes, offers a principled bridge between marginal screening and joint penalized regression, resolving some instabilities of classical Lasso in highly correlated and high-dimensional data (Chatterjee et al., 30 Jan 2025, Richland et al., 27 Nov 2025).