Latent Group Lasso

Updated 12 March 2026

Latent Group Lasso is a structured sparsity method that selects unions of potentially overlapping feature groups using latent variable decomposition.
It employs a convex penalty summing groupwise L2 norms, ensuring accurate model recovery and interpretability under high-dimensional constraints.
Efficient algorithms, such as covariate duplication and proximal methods, enable practical application in genomics, hierarchical modeling, and network analysis.

The Latent Group Lasso (LGL) is a structured sparsity-inducing technique that generalizes the classical group Lasso to settings where groups of features may overlap. By modeling latent variables supported on predefined groups and penalizing the sum of their groupwise norms, LGL enables selection of unions of potentially overlapping groups while maintaining convexity and meaningful support properties. This formalism supports a rich class of model-encoded dependencies, allowing for principled penalized regression under complex structured sparsity—an essential tool in genomics, hierarchical modeling, and applications requiring domain-driven constraints.

1. Mathematical Formulation and Norm Structure

Given observations $y \in \mathbb{R}^n$ , predictor matrix $X \in \mathbb{R}^{n \times p}$ , and a collection of groups $G = \{g_1, \ldots, g_I\}$ where $g_i \subset \{1, \ldots, p\}$ (permits overlaps), the LGL norm is defined via latent variables $v^{(g)} \in \mathbb{R}^p$ for each group $g$ :

$\operatorname{supp}(v^{(g)}) \subset g$
$\sum_{g \in G} v^{(g)} = w$ (the regression coefficient vector)

The LGL penalty is: $\Omega(w) = \min_{\{v^{(g)}\}:\, \sum_g v^{(g)}=w} \sum_{g \in G} d_g\,\|v^{(g)}\|_2$ where $d_g > 0$ are group weights.

The penalized objective takes the standard form: $\min_{w \in \mathbb{R}^p} L(w) + \lambda\, \Omega(w)$ with $L(\cdot)$ a convex loss (e.g., squared error or logistic).

The dual norm is

$\Omega^*(\alpha) = \max_{g \in G} d_g^{-1} \|\alpha_g\|_2$

and the subdifferential at $w$ comprises

$\partial \Omega(w) = \{\alpha: \Omega^*(\alpha) \leq 1,\, \alpha^{\top} w = \Omega(w)\}$

where, for each $g$ , if $v^{(g)} \neq 0$ then $\alpha_g = d_g \frac{v^{(g)}}{\|v^{(g)}\|_2}$ , and if $v^{(g)} = 0$ , then $\|\alpha_g\|_2 \leq d_g$ (Obozinski et al., 2011).

2. Support Properties and Model Selection Consistency

The support induced by LGL is the union of those groups $g$ for which $v^{(g)} \neq 0$ in the optimal latent decomposition—a "weak group-support." For the linear model $y = Xw^* + \varepsilon$ , LGL achieves model recovery under incoherence-type assumptions involving the Gram matrix $\Sigma$ on the true support $J$ . Specifically, to ensure no spurious groups: $\forall\, g \notin \mathcal{G}_1(w^*)\qquad \|\Sigma_{g, J_1} \Sigma_{J_1, J_1}^{-1}\alpha_{J_1}(w^*)\|_2 \leq d_g$ and strict inequality for exact exclusion (Obozinski et al., 2011).

Under suitable decay of $\lambda_n$ ( $\lambda_n \to 0$ , but $\lambda_n \sqrt{n} \to \infty$ ), LGL selects no false-positive groups with high probability as $n \to \infty$ . When the decomposition is essentially unique, the group-support is also exactly recovered.

3. Role and Selection of Group Weights

The weights $\{d_g\}$ crucially determine the admissible supports and calibrate the penalty between groups of different sizes or nesting relationships:

To prevent redundancy, if $g \subset h$ , require $d_g < d_h$ .
To avoid dominance, the weights must scale sufficiently steeply with group size; for instance, $d_k = \sqrt{k + c \sqrt{k}}$ is sufficient to control spurious group activation under pure noise.
Alternative weighting, such as $d_k = k^{\gamma}$ with $\gamma \in (0, 1/2)$ , is viable for fine-tuning trade-offs in FDR/FNR, with $\gamma \approx 1/4$ yielding balanced selection (Obozinski et al., 2011).

4. Algorithms and Computational Strategies

Efficient LGL algorithms exploit the special structure of latent decomposition:

Covariate duplication: Construct an expanded design matrix by duplicating each variable

i

for every group

g

containing

i

. The optimization reduces to a standard disjoint group Lasso in this space, admitting block coordinate descent:

Initialize v^(g)=0 for all g. Repeat until convergence:
  for each group g in G in cyclic order
    r ← y – Σ_h≠g X_h v^(h)
    z ← X_gᵀ r / n
    v^(g) ← (1 – λ d_g/‖z‖)_+ ⋅ z
w ← Σ_g v^(g)

Proximal methods: Alternative approaches apply accelerated proximal-gradient descent on $w$ , computing $\operatorname{prox}_{\lambda \Omega}(\cdot)$ by projection onto the intersection of norm-constrained cylinders, possibly using dual projected-Newton for the Euclidean case (Villa et al., 2012).
Active-set strategies: Iteratively restrict attention to currently active groups where $\|u_{G_r}\|_2 > \tau$ . This leads to major computational speedups, especially as the solution sparsifies and the number of active groups stabilizes (Villa et al., 2012).
MKL interpretations: LGL can be seen as Multiple Kernel Learning with each group contributing a kernel $K_g = X_g X_g^\top$ and optimizing a convex combination under constraints related to $d_g$ (Obozinski et al., 2011).

5. Extensions: Hierarchical, Overlapping, and Rule-Based Grouping

LGL underpins a suite of structured penalties including:

Hierarchical models: When groups are derived from a DAG structure, LGL (more precisely, "Latent Overlapping Group Lasso" or LOG) regularizes via latent vectors on ancestor sets. This ensures that if a deep parameter is nonzero, all its ancestors are, enforcing atomic-level hierarchical zero patterns (Yan et al., 2015).
Complex selection rules: LGL readily encodes arbitrary combinatorial rules among predictors through appropriate group collection design. This enables regularization under domain-mandated constraints (e.g., strong heredity, force-in groups, logical interaction inclusion). The support of the estimator then exactly aligns with the allowed model dictionary as determined by these rules (Wang et al., 2022).
Network and latent structure: LGL has been generalized to latent group structures induced by networks, where no explicit group labels are required. Here, the penalty is defined by a heat-flow on a Laplacian-encoded graph, smoothly interpolating between standard Lasso and classical group Lasso by varying the diffusion time, and admitting efficient local Monte Carlo optimization schemes (Ghosh et al., 20 Jul 2025).
Time-varying and panel data contexts: Estimation of latent group structures in time-varying panel data leverages adaptive group fused-Lasso penalties, identifying group homogeneity in coefficient trajectories while maintaining oracle and clustering consistency under theoretical guarantees (Haimerl et al., 29 Mar 2025).

6. Empirical Performance and Applications

LGL exhibits marked empirical advantages over standard Lasso and disjoint group Lasso:

Simulation studies: In synthetic problems with overlapping groups, LGL achieves nearly perfect support recovery (e.g., ≈99% at $n = 100$ ) compared to plain Lasso, which fails consistently. For chain graphs, larger block recovery is possible only for $k \geq 2$ using LGL, surpassing Lasso (Obozinski et al., 2011).
Biological applications: On breast cancer microarray data, employing groups from KEGG pathways, LGL delivers improvements in balanced accuracy (2–12%), reduces model complexity, and identifies more interpretable and reproducible biological signatures (Obozinski et al., 2011, Villa et al., 2012).
Structured network domains: Graph-Lasso with LGL selects much larger and more coherent subnetworks, aligning better with biological relevance compared to $\ell_1$ approaches (Obozinski et al., 2011).
Hierarchical modeling and covariance estimation: In time series and banded covariance estimation, LGL matches or betters group Lasso in estimation accuracy and support recovery while offering simpler and computationally efficient proximal operators—especially for path- and tree-structured groupings (Yan et al., 2015).
Prediction models under selection constraints: LGL encoding of clinical selection rules yields sparser models that respect all mandated dependencies and improves cross-validated risk compared to unconstrained (adaptive) Lasso (Wang et al., 2022).

7. Connections, Generalizations, and Complexity Considerations

LGL’s convexity and latent variable formulation allow seamless extension to overlapping, nested, or network-defined groupings while ensuring tractable optimization via modern first-order primal-dual or block coordinate strategies.
The computational overhead due to group overlap is mitigated by active set acceleration, projection-based prox-computation, and efficient handling of the latent variable decomposition, enabling practical deployment in high-dimensional regimes without need for dimensionality reduction pre-processing (Villa et al., 2012).
The formulation's flexibility accommodates classical $\ell_1$ and disjoint $\ell_{2,1}$ group Lasso as special cases by, respectively, setting each group to singletons or using a disjoint partition.
Network-induced latent group penalties facilitate learning under latent grouping not specified a priori, with sample complexity scaling logarithmically in ambient dimension and no requirement for variable pre-clustering (Ghosh et al., 20 Jul 2025).

A plausible implication is that LGL and its modern extensions now serve as essential tools for interpretable and structure-aware feature selection in high-dimensional statistics, with diverse applications from biomedicine to econometrics and statistical network analysis.