Group Lasso Extensions
- Group lasso extensions are regularization techniques that enforce sparsity and select groups of variables using ℓ₂/ℓ₁ penalties.
- They include variants such as sparse, overlapping, hierarchical, and exclusive group lasso, each tailored to different data structures and applications.
- Advanced algorithms like block coordinate descent and proximal methods enable scalability and strong theoretical guarantees in high-dimensional settings.
Group Lasso Extensions
The group lasso extends the classical lasso by enforcing sparsity at the group level via an ℓ₂/ℓ₁ penalty, allowing all variables in a group to be excluded or entered together. Group lasso extensions generalize this framework for greater flexibility, improved modeling of structured sparsity, complex group structures (including overlaps and hierarchy), joint feature-group selection, and scalable algorithms. This article covers the mathematical definitions, key algorithms, principal extensions such as sparse and exclusive group lasso, overlapping group penalties, statistical guarantees, and selected applications.
1. Sparse Group Lasso: Joint Group and Within-Group Selection
The sparse group lasso (SGL) combines an ℓ₁ penalty (lasso) with the group lasso's ℓ₂/ℓ₁ block penalty, yielding solutions with both group-level and within-group sparsity. The canonical linear model formulation is:
where is partitioned into disjoint groups indexed by , and . When this reduces to the group lasso; recovers the lasso. With both nonzero, SGL produces two-level sparsity: entire groups can be eliminated and, within surviving groups, individual coefficients may be zero (Friedman et al., 2010, Liang et al., 2022).
An efficient block coordinate descent algorithm is available that updates one group at a time by:
- Computing group-level soft-thresholding for candidate zeroing.
- Solving an inner coordinate descent for nonzero groups, where each coordinate update employs soft-thresholding for the ℓ₁ part and, when the group design is orthonormal, enables closed-form groupwise soft-thresholding.
These algorithms guarantee monotonic decrease of the objective and convergence to a global minimizer under convexity (Friedman et al., 2010, Foygel et al., 2010). SGL is practically important in genomics and imaging, where variable groupings represent gene sets or connectivity clusters, but only a sparse subset of variables within relevant groups may contribute to prediction.
2. Overlapping and Latent Group Lasso
Classical group lasso assumes a partition of variables, but biological and structured domains often require overlapping groupings. The latent group lasso (LGL) penalty addresses this by introducing latent copies restricted to each group , with coupling constraints. Its penalty is
For any , yields the minimal sum of groupwise norms that can reconstruct as a sum of group-supported components. The dual norm is , and the subdifferential can be characterized explicitly (Obozinski et al., 2011, Villa et al., 2012). Non-redundant group design requires strictly increasing weights for nested groups.
Optimization leverages either replication of variables (covariate duplication) or, more efficiently, dedicated proximal algorithms that handle constraints without variable expansion. First-order solvers such as FISTA, with inner projections onto the intersection of groupwise norm-constraint cylinders, yield convergence (Villa et al., 2012). Active-set strategies further accelerate computation by restricting projections to groups with violated KKT conditions.
Latent group lasso generalizes well: for disjoint groups it reduces to the standard group norm; for tree, chain, or graph structures it can encode pathway or spatial constraints. In empirical settings, including high-throughput biology, LGL yields exact group-support recovery aligned with natural group unions when weights are properly chosen (Obozinski et al., 2011).
3. Hierarchical and Structured Extensions
Hierarchical modeling requires that certain variables can only be nonzero if others (e.g., "parent" variables in a DAG) are nonzero. Two principal frameworks have been proposed (Yan et al., 2015):
- Standard Group Lasso (GL): Penalty over descendant-sets; leads to aggressive shrinkage of deep nodes (variables appearing in many groups).
- Latent Overlapping Group Lasso (LOG): Penalty over ancestor-sets using latent variables; yields balanced shrinkage regardless of group hierarchy depth.
LOG formulations admit closed-form path-proximal operators on chain/path hierarchies and efficient path-based block coordinate descent methods for DAGs. For tasks such as covariance estimation with banded or hierarchical sparsity, LOG achieves statistical rates matching or surpassing GL, but with simpler tuning and less sensitivity to depth-dependent regularization (Yan et al., 2015). Modified GL (mGL) can interpolate between these but requires nontrivial, bespoke weightings.
4. Exclusive Group Lasso and Variants
A distinct extension is the exclusive group lasso (EGL), designed to promote sparsity within groups while encouraging many groups to be active. The atomic norm is
Unlike group lasso (ℓ₁ across group ℓ₂ norms), EGL reverses the order to ℓ₂ across group ℓ₁ norms (Gregoratti et al., 2021, Sun et al., 2020). This configuration strongly incentivizes at most one nonzero per group (in the extreme), relevant where interpretability demands you select features distributed across as many groups as possible, but only a few per group.
Optimization employs proximal algorithms with water-filling–style blockwise soft-thresholding or active-set methods. EGL can be extended to the "unknown group" setting by random group assignment and stability selection; here, selection frequencies over random partitions yield robust feature identification. In simulation and TCR-sequencing applications, extended EGL reveals more comprehensive and stable feature sets than standard lasso when the signal is diffuse or strongly correlated within groups (Sun et al., 2020). Asymptotic theory establishes support-consistency under incoherence and balanced group size conditions (Gregoratti et al., 2021).
5. Generalized Models and Theoretical Guarantees
Group lasso and its extensions have been developed beyond Gaussian models to generalized linear models (GLM), nonparametric functional models, and Poisson regression (Blazère et al., 2013, Ivanoff et al., 2014). For these settings, theoretical analysis focuses on:
- Oracle inequalities: Under restricted eigenvalue-type or group-stabil conditions, the group lasso and variants attain optimal rates for prediction and estimation error, scaling as for active groups, irrespective of (Blazère et al., 2013, Ivanoff et al., 2014).
- Consistency: Necessary and sufficient irrepresentable conditions are provided for exact recovery in both finite- and infinite-dimensional (RKHS) settings, with adaptive weighting restoring consistency when these conditions fail (0707.3390).
- Group square-root lasso (GSRL): Penalty attached to the square-root loss enables variance-free tuning, adapts to unknown sparsity, and matches group lasso rates, with globally convergent algorithms (Bunea et al., 2013).
- Support recovery: Both standard and latent group lasso admit explicit conditions for correct selection of active groups, even in the presence of overlapping group structures (Obozinski et al., 2011, Villa et al., 2012, Yan et al., 2015).
The following table summarizes some statistical guarantees for prominent penalties:
| Penalty Class | Model Consistency Conditions | Oracle Error Rate |
|---|---|---|
| Group Lasso | Block irrepresentable; | |
| Latent Group Lasso | Extended block irrepresentable (overlaps) | |
| Sparse Group Lasso | KKT plus block/coordinate RE | Matches group lasso |
| Group Square-Root Lasso | Compatibility, group irrepresentable | Matches group lasso; no tuning |
| Exclusive Group Lasso | Incoherence, group balance | Exponential support consistency |
6. Algorithmic Advances and Scalability
Recent algorithmic advances focus on block coordinate descent, majorization-minimization, proximal splitting, and consensus-based parallelization:
- Exact block-wise updates: Single Line Search (SLS) and Signed SLS (SSLS) compute exact groupwise updates for both group lasso and sparse group lasso, efficient for moderate group size (Foygel et al., 2010).
- Block Coordinate Descent: Fast Newton/bisection plus active-set cycling, supporting pathwise regularization (Yang et al., 2024).
- Screening rules: Sequential strong rules and KKT-violation checks eliminate unnecessary updates over regularization paths (Liang et al., 2022, Yang et al., 2024).
- Parallelization: Divide-and-conquer methods can split both group lasso and overlapping group lasso across distributed machines with simple majority voting for support, followed by coefficient averaging; this achieves near-constant wall-clock time for massive data (Chen et al., 2016).
Comparison table (solver approaches):
| Extension | Blockwise Closed-form | Overlaps | Parallel/Distributed | Screening |
|---|---|---|---|---|
| SGL, GSRL | Yes (orthonormal) | No | Yes (via BCD) | Yes |
| LGL, LOG | Pathwise/BCD/Proj | Yes | Select-and-discard | Partial |
| Exclusive GL | Water-filling/BCD | Yes | No | Not standard |
| DC-gLasso | Yes (BCD) | Yes | Yes | No |
7. Applications and Practical Guidance
Group lasso extensions are widely used in high-dimensional studies where variables have natural groupings, feature selection is nontrivial, and interpretable structure is essential:
- Genomics: Variable selection over gene sets, pathways, or interaction networks, leveraging overlapping/latent penalties to reflect biological redundancy (Obozinski et al., 2011, Villa et al., 2012).
- Neuroimaging: Sparse and group-sparse models map connectivity.
- Covariance estimation: Hierarchical/LOG formulations recover banded or structured precision matrices efficiently (Yan et al., 2015).
- Poisson and count data: Group lasso adapted via concentration inequalities and data-driven tuning to accommodate heteroscedastic regression (Ivanoff et al., 2014).
- High-throughput applications: Highly optimized packages (e.g., adelie, sparsegl) employ efficient C/Fortran backends, sparse matrix support, and outpace earlier solvers by orders of magnitude (Liang et al., 2022, Yang et al., 2024).
- Correlated feature selection: Exclusive group lasso and its extended forms demonstrate advanced capabilities in comprehensive marker identification under strong signal collinearity (Sun et al., 2020).
In practice, vs group penalty weighting, group design (overlaps, size), and computational constraints determine the optimal extension. Block coordinate descent with screening and warm starts is currently the state-of-the-art for large-scale problems, with active-set methods dominating when true support is highly sparse or structured. Extensions targeting unknown group structure employ randomization and stability selection to address uncertainty.
References:
- "A note on the group lasso and a sparse group lasso" (Friedman et al., 2010)
- "Group Lasso with Overlaps: the Latent Group Lasso approach" (Obozinski et al., 2011)
- "Proximal methods for the latent group lasso penalty" (Villa et al., 2012)
- "Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations" (Yan et al., 2015)
- "Exclusive Group Lasso for Structured Variable Selection" (Gregoratti et al., 2021)
- "Correlated Feature Selection with Extended Exclusive Group Lasso" (Sun et al., 2020)
- "A Fast and Scalable Pathwise-Solver for Group Lasso and Elastic Net Penalized Regression via Block-Coordinate Descent" (Yang et al., 2024)
- "sparsegl: An R Package for Estimating Sparse Group Lasso" (Liang et al., 2022)
- "Oracle inequalities for a Group Lasso procedure applied to generalized linear models in high dimension" (Blazère et al., 2013)
- "Adaptive Lasso and group-Lasso for functional Poisson regression" (Ivanoff et al., 2014)
- "The Group Square-Root Lasso: Theoretical Properties and Fast Algorithms" (Bunea et al., 2013)
- "Exact block-wise optimization in group lasso and sparse group lasso for linear regression" (Foygel et al., 2010)
- "A Communication-Efficient Parallel Method for Group-Lasso" (Chen et al., 2016)
- "Consistency of the group Lasso and multiple kernel learning" (0707.3390)