Sparse Linear Regression

Updated 16 November 2025

Sparse linear regression is the estimation of a sparse parameter vector in a linear model, crucial for high-dimensional data analysis.
Graph-based Square-Root Estimation (GSRE) employs a square-root loss and overlapping group penalties to interpolate between Lasso, group Lasso, and ridge-type methods.
Efficient ADMM algorithms and strong theoretical guarantees, including finite-sample error bounds and asymptotic normality, support its application in statistics, machine learning, and signal processing.

Sparse linear regression is the statistical problem of estimating a parameter vector $\beta^*$ in the linear model $y = X\beta^* + \varepsilon$ , under the assumption that $\beta^*$ is sparse (i.e., most of its entries are zero). This formulation is central in high-dimensional statistics, machine learning, signal processing, and computational biology, where the number of variables $p$ may greatly exceed the number of observations $n$ . Sparse linear regression is closely connected to concepts in optimization, computation, and graphical modeling, and the theoretical and algorithmic aspects are subject to active research.

1. General Framework and Graph-based Square-Root Estimation

In the contemporary mathematical formulation, given $X\in\mathbb{R}^{n\times p}$ and $y\in\mathbb{R}^n$ , the goal is to recover $\beta^*\in\mathbb{R}^p$ satisfying $y = X\beta^* + \varepsilon$ , under sparsity constraints. The "Graph-based Square-Root Estimation" (GSRE) model provides a flexible framework for incorporating prior structural information among predictors and addressing the high-dimensional regime.

The GSRE estimator solves the optimization problem:

$\min_{\beta\in\mathbb{R}^p} \;\frac{1}{\sqrt{n}}\|y-X\beta\|_2 + \frac{\lambda}{n}\|\beta\|_{G,\tau}$

where:

The square-root loss $\frac{1}{\sqrt{n}}\|y-X\beta\|_2$ renders the regularization parameter $\lambda$ pivotal, i.e., independent of the unknown noise standard deviation.
The graph-based norm $\|\beta\|_{G,\tau}$ depends on an undirected "predictor graph" $G=(V, E)$ over variables, with local neighborhoods $\mathcal{N}_i$ and positive weights $\tau_i$ . It is defined as

$\|\beta\|_{G, \tau} := \min_{\sum_{i=1}^p V^{(i)}=\beta,\;\mathrm{supp}(V^{(i)})\subseteq \mathcal{N}_i} \sum_{i=1}^p \tau_i \|V^{(i)}\|_2.$

This norm encodes overlapping group sparsity regularization adapted to the graphical structure of predictors.

Depending on $G$ , GSRE recovers a range of classic estimators:

For $G$ with no edges, $\|\beta\|_{G, \tau} = \sum_i \tau_i|\beta_i|$ ; GSRE becomes the classic square-root Lasso.
For $G$ as disjoint unions of complete graphs, GSRE reduces to the group square-root Lasso with group-specific $\ell_2$ -penalties.
For complete $G$ , the estimator induces an $\ell_2$ (ridge-type) penalty.

2. Theoretical Guarantees: Error Bounds, Asymptotics, and Model Selection

GSRE admits comprehensive theoretical properties concerning both estimation and variable selection.

Finite-Sample Error Bounds:

Suppose $\kappa>0$ is a restricted eigenvalue or compatibility constant for $X$ . Under compatibility conditions and mild overlap assumptions on the graph, GSRE enjoys:

$\frac{\|X(\hat\beta-\beta^*)\|_2}{\sqrt n} < C\, \frac{\sigma\,\lambda\,\sqrt{s^*}}{n\,\kappa}, \qquad \|\hat\beta-\beta^*\|_2 < C''\,\frac{\sigma\,\lambda\,s^*}{n\,\kappa^2\,\tau_{\min}}$

for true sparsity $s^*=|\mathrm{supp}(\beta^*)|$ , noise level $\sigma$ , and constants $C, C''$ related to graph overlap.

Asymptotic Normality:

For fixed $p$ and suitable $\lambda$ , GSRE is asymptotically efficient on the active set:

$\sqrt n\,(\hat\beta_{S^*}-\beta^*_{S^*}) \Longrightarrow \mathcal{N}\left(0,\,\sigma^2(X_{S^*}^\top X_{S^*}/n)^{-1}\right)$

with $\hat\beta_{(S^*)^c}\to 0$ in probability.

Model-Selection Consistency:

If $p\to\infty$ , the method selects the correct support with probability tending to one under an irrepresentable-type condition, $s^*=o(n^\delta)$ , and sufficient signal strength ( $\min|\beta_i^*|$ large relative to noise).

3. Algorithmic Strategies: Efficient Computation via ADMM

The GSRE estimator is computed by introducing auxiliary variables and solving an augmented Lagrangian via the Alternating Direction Method of Multipliers (ADMM). The algorithm iterates through:

Quadratic minimization for the $\beta$ -variable (efficient via Sherman–Morrison–Woodbury when $p\gg n$ ).
Proximal updates for $u$ given square-root loss and for $v$ given the graph-based norm.
Customized proximal maps for group-wise projections in computing $\mathrm{prox}_{\|\cdot\|_{G,\tau}}$ .
Dual variable updates with step-size $\tau\in(0,(1+\sqrt{5})/2)$ , where standard 2-block ADMM convergence theory applies.

This yields a scalable procedure for high-dimensional structured-sparsity problems.

4. Special Cases and Relation to Classical Methods

The GSRE framework encompasses, as special cases:

Graph Structure	Penalty Term	Resulting Estimator
No edges (diag.)	$\sum_i \tau_i \|\beta_i\|$	Classic Square-root Lasso
Disjoint complete	$\sum_j \tilde\tau_j \\|\beta^{(j)}\\|_2$	Group Square-root Lasso
Complete graph	$\tilde\lambda \\|\beta\\|_2^2$	Square-root Ridge-type Penalty

Thus, GSRE interpolates naturally between element-wise, group, and ridge-type regularization within a unified square-root-loss framework.

5. Empirical Performance and Robustness

Extensive synthetic and real-data experiments evidence the empirical advantages of GSRE over traditional methods:

In the $p > n$ $p > n$ regime ( $n=20,40,60; p=100$ $n = 20, 40, 60; p = 100$ ), GSRE achieves:
- Lowest average $\ell_2$ -error (e.g., $\sim 3.4$ versus $8$--$12$ for Lasso/Alasso/SRL)
- Best Relative Prediction Error (RPE)
- Nearly zero false negative rate; false positive rate below 3%
Under non-Gaussian and heavy-tailed noise (Student $t$ , Laplace, Uniform), GSRE outperforms alternatives owing to the robustness induced by the square-root loss.
On high-dimensional real datasets (e.g., bodyfat2, miRNA–cancer survival), 10-fold CV demonstrates that GSRE's median test-MSE is about half or less compared to Lasso, Elastic-Net, square-root Lasso, or least-squares graph-penalty. For instance, on the bodyfat2 dataset, GSRE's MSE median is $0.38$ versus $>1.0$ for competing approaches.

6. Significance and Practical Impact

The GSRE model addresses key challenges in sparse linear regression:

Unknown Noise Level: Pivotal regularization via square-root loss eliminates the need to know or estimate the noise variance, simplifying parameter selection and enhancing robustness.
Graphical Structure Utilization: Node-wise overlapping group penalties encode arbitrary predictor relationships—collinearity, clusters, group structure—leading to improved estimation and feature selection.
Scalability and Adaptability: Efficient ADMM-based solvers, together with the flexibility in structural prior specification, facilitate application to high-dimensional and complex-structure regression problems.

These properties, together with finite-sample near-oracle error rates, asymptotic normality, and strong empirical results, underline GSRE as a general and effective methodology for modern sparse linear regression tasks in high-dimensional statistics and machine learning (Li et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Graph-based Square-Root Estimation for Sparse Linear Regression (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse Linear Regression.