Papers
Topics
Authors
Recent
2000 character limit reached

Sparse Linear Regression

Updated 16 November 2025
  • Sparse linear regression is the estimation of a sparse parameter vector in a linear model, crucial for high-dimensional data analysis.
  • Graph-based Square-Root Estimation (GSRE) employs a square-root loss and overlapping group penalties to interpolate between Lasso, group Lasso, and ridge-type methods.
  • Efficient ADMM algorithms and strong theoretical guarantees, including finite-sample error bounds and asymptotic normality, support its application in statistics, machine learning, and signal processing.

Sparse linear regression is the statistical problem of estimating a parameter vector β\beta^* in the linear model y=Xβ+εy = X\beta^* + \varepsilon, under the assumption that β\beta^* is sparse (i.e., most of its entries are zero). This formulation is central in high-dimensional statistics, machine learning, signal processing, and computational biology, where the number of variables pp may greatly exceed the number of observations nn. Sparse linear regression is closely connected to concepts in optimization, computation, and graphical modeling, and the theoretical and algorithmic aspects are subject to active research.

1. General Framework and Graph-based Square-Root Estimation

In the contemporary mathematical formulation, given XRn×pX\in\mathbb{R}^{n\times p} and yRny\in\mathbb{R}^n, the goal is to recover βRp\beta^*\in\mathbb{R}^p satisfying y=Xβ+εy = X\beta^* + \varepsilon, under sparsity constraints. The "Graph-based Square-Root Estimation" (GSRE) model provides a flexible framework for incorporating prior structural information among predictors and addressing the high-dimensional regime.

The GSRE estimator solves the optimization problem:

minβRp  1nyXβ2+λnβG,τ\min_{\beta\in\mathbb{R}^p} \;\frac{1}{\sqrt{n}}\|y-X\beta\|_2 + \frac{\lambda}{n}\|\beta\|_{G,\tau}

where:

  • The square-root loss 1nyXβ2\frac{1}{\sqrt{n}}\|y-X\beta\|_2 renders the regularization parameter λ\lambda pivotal, i.e., independent of the unknown noise standard deviation.
  • The graph-based norm βG,τ\|\beta\|_{G,\tau} depends on an undirected "predictor graph" G=(V,E)G=(V, E) over variables, with local neighborhoods Ni\mathcal{N}_i and positive weights τi\tau_i. It is defined as

βG,τ:=mini=1pV(i)=β,  supp(V(i))Nii=1pτiV(i)2.\|\beta\|_{G, \tau} := \min_{\sum_{i=1}^p V^{(i)}=\beta,\;\mathrm{supp}(V^{(i)})\subseteq \mathcal{N}_i} \sum_{i=1}^p \tau_i \|V^{(i)}\|_2.

This norm encodes overlapping group sparsity regularization adapted to the graphical structure of predictors.

Depending on GG, GSRE recovers a range of classic estimators:

  • For GG with no edges, βG,τ=iτiβi\|\beta\|_{G, \tau} = \sum_i \tau_i|\beta_i|; GSRE becomes the classic square-root Lasso.
  • For GG as disjoint unions of complete graphs, GSRE reduces to the group square-root Lasso with group-specific 2\ell_2-penalties.
  • For complete GG, the estimator induces an 2\ell_2 (ridge-type) penalty.

2. Theoretical Guarantees: Error Bounds, Asymptotics, and Model Selection

GSRE admits comprehensive theoretical properties concerning both estimation and variable selection.

Finite-Sample Error Bounds:

Suppose κ>0\kappa>0 is a restricted eigenvalue or compatibility constant for XX. Under compatibility conditions and mild overlap assumptions on the graph, GSRE enjoys:

X(β^β)2n<Cσλsnκ,β^β2<Cσλsnκ2τmin\frac{\|X(\hat\beta-\beta^*)\|_2}{\sqrt n} < C\, \frac{\sigma\,\lambda\,\sqrt{s^*}}{n\,\kappa}, \qquad \|\hat\beta-\beta^*\|_2 < C''\,\frac{\sigma\,\lambda\,s^*}{n\,\kappa^2\,\tau_{\min}}

for true sparsity s=supp(β)s^*=|\mathrm{supp}(\beta^*)|, noise level σ\sigma, and constants C,CC, C'' related to graph overlap.

Asymptotic Normality:

For fixed pp and suitable λ\lambda, GSRE is asymptotically efficient on the active set:

n(β^SβS)N(0,σ2(XSXS/n)1)\sqrt n\,(\hat\beta_{S^*}-\beta^*_{S^*}) \Longrightarrow \mathcal{N}\left(0,\,\sigma^2(X_{S^*}^\top X_{S^*}/n)^{-1}\right)

with β^(S)c0\hat\beta_{(S^*)^c}\to 0 in probability.

Model-Selection Consistency:

If pp\to\infty, the method selects the correct support with probability tending to one under an irrepresentable-type condition, s=o(nδ)s^*=o(n^\delta), and sufficient signal strength (minβi\min|\beta_i^*| large relative to noise).

3. Algorithmic Strategies: Efficient Computation via ADMM

The GSRE estimator is computed by introducing auxiliary variables and solving an augmented Lagrangian via the Alternating Direction Method of Multipliers (ADMM). The algorithm iterates through:

  • Quadratic minimization for the β\beta-variable (efficient via Sherman–Morrison–Woodbury when pnp\gg n).
  • Proximal updates for uu given square-root loss and for vv given the graph-based norm.
  • Customized proximal maps for group-wise projections in computing proxG,τ\mathrm{prox}_{\|\cdot\|_{G,\tau}}.
  • Dual variable updates with step-size τ(0,(1+5)/2)\tau\in(0,(1+\sqrt{5})/2), where standard 2-block ADMM convergence theory applies.

This yields a scalable procedure for high-dimensional structured-sparsity problems.

4. Special Cases and Relation to Classical Methods

The GSRE framework encompasses, as special cases:

Graph Structure Penalty Term Resulting Estimator
No edges (diag.) iτiβi\sum_i \tau_i |\beta_i| Classic Square-root Lasso
Disjoint complete jτ~jβ(j)2\sum_j \tilde\tau_j \|\beta^{(j)}\|_2 Group Square-root Lasso
Complete graph λ~β22\tilde\lambda \|\beta\|_2^2 Square-root Ridge-type Penalty

Thus, GSRE interpolates naturally between element-wise, group, and ridge-type regularization within a unified square-root-loss framework.

5. Empirical Performance and Robustness

Extensive synthetic and real-data experiments evidence the empirical advantages of GSRE over traditional methods:

  • In the p>np > n regime (n=20,40,60;p=100n=20,40,60; p=100), GSRE achieves:
    • Lowest average 2\ell_2-error (e.g., 3.4\sim 3.4 versus $8$--$12$ for Lasso/Alasso/SRL)
    • Best Relative Prediction Error (RPE)
    • Nearly zero false negative rate; false positive rate below 3%
  • Under non-Gaussian and heavy-tailed noise (Student tt, Laplace, Uniform), GSRE outperforms alternatives owing to the robustness induced by the square-root loss.
  • On high-dimensional real datasets (e.g., bodyfat2, miRNA–cancer survival), 10-fold CV demonstrates that GSRE's median test-MSE is about half or less compared to Lasso, Elastic-Net, square-root Lasso, or least-squares graph-penalty. For instance, on the bodyfat2 dataset, GSRE's MSE median is $0.38$ versus >1.0>1.0 for competing approaches.

6. Significance and Practical Impact

The GSRE model addresses key challenges in sparse linear regression:

  • Unknown Noise Level: Pivotal regularization via square-root loss eliminates the need to know or estimate the noise variance, simplifying parameter selection and enhancing robustness.
  • Graphical Structure Utilization: Node-wise overlapping group penalties encode arbitrary predictor relationships—collinearity, clusters, group structure—leading to improved estimation and feature selection.
  • Scalability and Adaptability: Efficient ADMM-based solvers, together with the flexibility in structural prior specification, facilitate application to high-dimensional and complex-structure regression problems.

These properties, together with finite-sample near-oracle error rates, asymptotic normality, and strong empirical results, underline GSRE as a general and effective methodology for modern sparse linear regression tasks in high-dimensional statistics and machine learning (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sparse Linear Regression.