Sparse Group Lasso: Concepts & Applications

Updated 12 March 2026

Sparse Group Lasso is a convex regularization method that enforces both group-level and element-wise sparsity, unifying the strengths of lasso and group lasso.
It employs block coordinate descent and adaptive tuning parameters to optimize model performance in high-dimensional, structured data.
SGL is widely applied in genetics, neuroimaging, and deep learning, providing enhanced model recovery, interpretability, and robust selection performance.

The Sparse Group Lasso (SGL) is a convex regularization framework that enables simultaneous group-level and within-group (element-wise) sparsity in high-dimensional models. By unifying the strengths of the lasso and the group lasso, SGL has become foundational in structured regression, compressed sensing, large-scale machine learning, genetics, and interpretable deep learning, offering enhanced recovery, statistical power, and interpretability when predictors possess meaningful group structures.

1. Mathematical Definition and Regularization Structure

The canonical SGL estimator for a linear model observes responses $y \in \mathbb{R}^n$ and predictors $X \in \mathbb{R}^{n \times p}$ , partitioned into $G$ disjoint groups where group $g$ contains $p_g$ features. With regression vector $\beta = (\beta^{(1)}, \ldots, \beta^{(G)})$ , the SGL solves

$\min_{\beta \in \mathbb{R}^p} \ \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda_1 \sum_{g=1}^G w_g \|\beta^{(g)}\|_2 + \lambda_2 \|\beta\|_1,$

where $w_g > 0$ are group weights, and $\lambda_1,\lambda_2 \geq 0$ are regularization strengths controlling group and element-wise sparsity, respectively (Friedman et al., 2010, Liang et al., 2022).

A common reparametrization introduces a joint penalty level $\lambda > 0$ and a mixing parameter $\alpha \in [0,1]$ : $\min_{\beta} \ \frac{1}{2n} \|y - X\beta\|_2^2 + (1-\alpha)\lambda \sum_{g=1}^G w_g \|\beta^{(g)}\|_2 + \alpha\lambda \|\beta\|_1.$ When $\alpha=0$ it reduces to the group lasso (pure group sparsity); when $\alpha=1$ it recovers the lasso (pure $\ell_1$ sparsity) (Chen et al., 2021, Liang et al., 2022). The SGL penalty is a convex combination of group $\ell_2$ and overall $\ell_1$ norms, promoting block-sparsity and element-sparsity simultaneously.

2. Sparsity-Inducing Mechanism and Theoretical Properties

SGL induces a hierarchical sparsity pattern: entire blocks $\beta^{(g)}$ can be zeroed out (group sparsity), while individual coordinates within active groups can also be eliminated (within-group sparsity) (Friedman et al., 2010, Ndiaye et al., 2016). Key implications include:

For moderate $\alpha$ , SGL interpolates between group-lasso and lasso, providing a flexible trade-off in structured models (Wang et al., 2014, Chen et al., 2021).
In noisy and noiseless regimes, under sub-Gaussian designs and an “irrepresentable condition,” SGL achieves optimal sample complexity for double-sparse $(s, s_g)$ signals:

$n \gtrsim s_g \log\frac{d}{s_g} + s\log(e s_g b)$

is both necessary and sufficient for exact recovery, matching lower bounds (Cai et al., 2019).

In the high-dimensional limit, SGL achieves minimax-optimal estimation rate:

$\|\hat\beta - \beta^*\|_2^2 \lesssim \sigma^2 \frac{s_g\log(d/s_g) + s \log(es_g b)}{n}$

within the simultaneous sparsity class $\{ \beta: \|\beta\|_0 \leq s, \|\beta\|_{0,2} \leq s_g \}$ (Cai et al., 2019).

Adaptive SGL variants with data-driven weights on both penalties ( $\alpha_j, \omega_g$ ) theoretically restore the oracle property (variable selection consistency and asymptotic normality) under both fixed and increasing-dimensionality regimes (Poignard, 2016).

3. Optimization Algorithms and Computational Aspects

Block coordinate descent is the dominant technique for SGL, exploiting the groupwise structure to optimize each block $\beta^{(g)}$ while holding others fixed (Friedman et al., 2010, Vincent et al., 2012, Liang et al., 2022). Practical implementations often combine:

Exact or majorization-minimization coordinate updates leveraging soft-thresholding or groupwise line searches (Foygel et al., 2010, Friedman et al., 2010).
Strong/active set screening rules: e.g., SSR, DFR, TLFre, and GAP Safe that provably discard inactive features/groups prior to each update, yielding order-of-magnitude runtime reductions and handling $p \gg n$ (Wang et al., 2014, Feser et al., 2024, Ndiaye et al., 2016).
Warm starts and pathwise solutions over $\lambda$ for fast model selection, particularly effective when working on a grid over $\alpha$ and $\lambda$ (Liang et al., 2022).

R packages such as "sparsegl", "SGL", and "msgl" provide highly optimized SGL solvers for generalized linear models and multinomial problems, scaling up to hundreds of thousands of features with sparse design matrices (Liang et al., 2022, Vincent et al., 2012).

4. Extensions, Generalizations, and Adaptations

The basic SGL model has been extended along several axes:

Adaptive SGL: Both elementwise and groupwise penalty weights are adaptively set using preliminary estimates (e.g., OLS, PCA, PLS, or ridge), sharpening recovery of true sparse support and improving selection stability. Adaptive SGL achieves the oracle property under suitable regularity and tuning (Poignard, 2016, Civieta et al., 2019).
SGL for non-Gaussian loss: SGL adapts straightforwardly to logistic, multinomial, or quantile loss functions via proximal methods or block coordinate descent (Civieta et al., 2019, Vincent et al., 2012).
Structured/overlapping SGL: Graph-guided or hierarchical variants incorporate additional biological, network, or relational context (e.g., SGLGG for GWAS with gene-level graph constraints) (Yang et al., 2017).
SGL in deep learning: Groupwise penalties are imposed on connections from neurons or features, enabling automatic feature and neuron selection at all layers. The sparse-group penalty results in highly compact neural nets with minimal loss in accuracy (Scardapane et al., 2016).

SGL frameworks also enable debiased inference via nodewise-Lasso adjustments, producing asymptotically valid confidence intervals for active components (Cai et al., 2019).

5. Statistical Analysis, Phase Transitions, and Asymptotics

The asymptotic properties of SGL have been characterized using state-evolution (SE) for AMP-type solvers and classical convex analysis (Chen et al., 2021, Poignard, 2016):

AMP shows SGL’s two-level thresholding yields favorable estimation, false discovery, and true positive rates. For well-aligned groupings, SGL with low $\alpha$ (emphasizing group sparsity) outperforms lasso in MSE and FDP, while transitioning smoothly to lasso for less-informative groups.
Performance trade-offs are directly governed by $\alpha$ and the quality of groupings. If groups are perfect, SGL can achieve both low MSE and FDP; poor groupings cause SGL to revert to lasso-like behavior (Chen et al., 2021).
Theoretical guarantees extend to adaptive SGL, with consistency, asymptotic normality, and minimax optimality holding for appropriately chosen weights and penalty scaling (Cai et al., 2019, Poignard, 2016, Civieta et al., 2019).

6. Practical Considerations, Model Selection and Implementation

Model selection involves a two-dimensional sweep over $\lambda$ and $\alpha$ , often via cross-validation or (generalized) information criteria (Friedman et al., 2010, Wang et al., 2014, Liang et al., 2022). Key guidelines include:

Standardize predictors and choose group weights, typically $w_g = \sqrt{p_g}$ , to balance shrinkage across groups.
For signal structures with within-group heterogeneity, favor higher $\alpha$ ; for group-wise block patterns, favor lower $\alpha$ .
Strong screening (DFR, GAP Safe, TLFre) and active sets dramatically reduce computational expense, enabling grid search even in ultra-high-dimensional settings (Feser et al., 2024, Wang et al., 2014, Ndiaye et al., 2016).
In deep learning, integrate SGL as a regularizer into gradient-based architectures. Post-training thresholding yields compact, interpretable networks (Scardapane et al., 2016).

Example R usage from (Liang et al., 2022):

1
2
3

fit <- sparsegl(X, y, group=group, alpha=0.2)
cvf <- cv.sparsegl(X, y, group=group, alpha=0.2, nfolds=10)
coef(cvf, s="lambda.1se")

7. Applications and Comparative Performance

SGL is broadly applied in:

Genomics and GWAS: enabling discovery of genetic risk factors incorporating gene or pathway information (Yang et al., 2017).
Neuroimaging: mapping complex spatially-organized predictors with groupings across voxels or different imaging modalities (Ndiaye et al., 2016, Liang et al., 2022).
High-dimensional classification: multinomial models for text or genetics, outperforming plain lasso in both accuracy and parsimony (Vincent et al., 2012).
Deep learning pruning and compact representation: joint selection of input features, neurons, and connections (Scardapane et al., 2016).

Empirically, SGL dominates lasso or group lasso in double-sparse regimes, achieving lower error and higher selection specificity (Cai et al., 2019), and adaptive variants further enhance sparsity recovery and inferential validity (Poignard, 2016). The grouping effect is preserved within highly correlated blocks (see theoretical result in (Ahsen et al., 2014)). Advanced screening rules—TLFre, DFR, GAP Safe–yield speedups of $10\times$ to $100\times$ with no loss of accuracy (Feser et al., 2024, Wang et al., 2014, Ndiaye et al., 2016, Liang et al., 2022).

References:

(Friedman et al., 2010) Friedman, Hastie, Tibshirani, "A note on the group lasso and a sparse group lasso"
(Foygel et al., 2010) Foygel & Drton, "Exact block-wise optimization in group lasso and sparse group lasso for linear regression"
(Vincent et al., 2012) Vincent & Hansen, "Sparse group lasso and high dimensional multinomial classification"
(Wang et al., 2014) Xiang et al., "Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets"
(Ahsen et al., 2014) Abbasi et al., "Two New Approaches to Compressed Sensing Exhibiting Both Robust Sparse Recovery and the Grouping Effect"
(Ndiaye et al., 2016) Ndiaye et al., "GAP Safe Screening Rules for Sparse-Group-Lasso"
(Scardapane et al., 2016) Scardapane et al., "Group Sparse Regularization for Deep Neural Networks"
(Poignard, 2016) Poignard, "Asymptotic Theory of the Sparse Group LASSO"
(Yang et al., 2017) Liu et al., "Identifying Genetic Risk Factors via Sparse Group Lasso with Group Graph Structure"
(Cai et al., 2019) Yuan et al., "Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference"
(Civieta et al., 2019) Sobotka et al., "Quantile regression: a penalization approach"
(Chen et al., 2021) Liao et al., "Asymptotic Statistical Analysis of Sparse Group LASSO via Approximate Message Passing Algorithm"
(Wang et al., 2021) Wang & Tang, "A dual semismooth Newton based augmented Lagrangian method for large-scale linearly constrained sparse group square-root Lasso problems"
(Liang et al., 2022) Liang et al., "sparsegl: An R Package for Estimating Sparse Group Lasso"
(Feser et al., 2024) Feser & Evangelou, "Dual feature reduction for the sparse-group lasso and its adaptive variant"