Sparse Regularization and Feature Selection

Updated 5 December 2025

Sparse regularization and feature selection are techniques that impose penalties to enforce sparsity, enabling clearer model interpretation in high-dimensional settings.
They utilize both convex (e.g., LASSO) and nonconvex (e.g., SCAD, MCP) penalties to selectively reduce redundant features and improve statistical recovery.
Modern approaches integrate these methods into deep learning and unsupervised frameworks, achieving superior variable selection, prediction accuracy, and computational scalability.

Sparse regularization and feature selection are fundamental concepts in modern statistical learning and machine learning, underpinning modeling strategies in high-dimensional regimes and supporting interpretability, generalization, and computational efficiency. At their core, these approaches introduce explicit penalties or constraints that drive most of a model’s coefficients, weights, or input connections to exact zeros, leaving only a small, data-adaptive subset—effectively performing variable selection or structural compression. Methodologies for sparse regularization have evolved from convex penalties such as the LASSO ( $\ell_1$ ) to nonconvex quasi-norms ( $\ell_p$ , $0 $\ell_{2,1}$

1. Mathematical Foundations and Regularization Structures

Sparse regularization is formalized by penalties or constraints that favor models with a small active set of parameters. Key structures include:

Convex penalties: The LASSO ( $\ell_1$ norm) enforces individual sparsity and is tractable via coordinate descent and proximal algorithms. The group LASSO ( $\ell_{2,1}$ norm) promotes group/row-wise sparsity, e.g., entire features or groups are selected or discarded together (Luo et al., 2023).
Nonconvex penalties: SCAD (Smoothly Clipped Absolute Deviation) and MCP (Minimax Concave Penalty) reduce estimator bias and can more accurately recover true sparse supports, with penalties such as

$\rho_\lambda^{\text{MCP}}(t) = \lambda t - \frac{t^2}{2a}\ \text{for}\ t \leq a\lambda;\quad \frac{a\lambda^2}{2}\ \text{otherwise}$

(Luo et al., 2023, Laporte et al., 2015).

Cardinality constraints: Direct $\ell_0$ or group $\ell_{2,0}$ penalties explicitly limit the number of nonzeros but yield combinatorially hard problems, requiring surrogate relaxations (Boolean relaxation, iterative thresholding) or discrete optimization (Sun et al., 2020, Bertsimas et al., 2019).
Quasi-norms: $\ell_p$ -norms for $0 < p < 1$ more closely approximate the $\ell_0$ -norm, yielding sparser solutions than LASSO at the cost of nonconvexity (Peng et al., 2015).

These structures are embedded into supervised and unsupervised frameworks, extend to nonparametric/functional models (e.g. sparse PCA), and are now foundational in deep architectures via group-wise and entry-wise penalties, stochastic feature gates, and subnetwork top- $k$ selection.

2. Core Methodologies and Algorithmic Strategies

Sparse regularization schemes require careful optimization strategies attuned to nonconvexity and non-differentiability. Representative methods include:

Proximal and thresholding algorithms: Proximal gradient descent and group thresholding solve composite objectives for $\ell_1$ , group LASSO, SCAD, and MCP penalties. Closed-form soft/hard thresholding operators enable efficient parameter updates (Luo et al., 2023, Xu et al., 2022, Sun et al., 2020).
Iterative reweighted schemes: Concave penalties such as the $\ell_1/2$ (quasi-norm) and nonconvex $\ell_p$ are optimized via iterative reweighted $\ell_1$ minimization or majorization–minimization, updating surrogate weights based on previous iterations (Han et al., 2014, Peng et al., 2015, Laporte et al., 2015).
Alternating block minimization: Joint learning of feature selectors and auxiliary variables (e.g. graphs, pseudo-labels in unsupervised feature selection) makes use of block-coordinate or PAM (proximal alternating minimization) strategies (Xiu et al., 22 Dec 2024, Sun et al., 2020).
Pathwise continuation: Homotopy and backward continuation (dense $\to$ sparse $\to$ denser) strategies are essential for stability and convergence, especially in nonconvex neural settings (Luo et al., 2023, Lemhadri et al., 2019).
Stochastic methods: Binary stochastic filtering (BSF) stochastically gates feature inputs with learnable probabilities, trained via straight-through estimators to allow SGD-based optimization (Trelin et al., 2020).
Top- $k$ regularization: Enforces a hard selection of $k$ features per submodel or subnetwork through masking, ranking, and explicit subnetwork loss components (Wu et al., 2021).

Optimization complexity varies with scheme (e.g., $O(nd)$ per iteration for LASSO, higher for large matrix eigenproblems in sparse PCA or for iterative hard-thresholding in $\ell_{2,0}$ minimization), but most modern algorithms scale to $d, n$ in the $10^4$ – $10^5$ range with careful implementation (Bertsimas et al., 2019, Sun et al., 2020).

3. Statistical Guarantees and Theoretical Properties

Sparse regularization methods provide nonasymptotic and asymptotic guarantees under specific conditions:

Oracle properties and support recovery: MCP, SCAD, and $\ell_0$ -type regularizations can achieve consistent identification of the true set of nonzero coefficients (support recovery), provided appropriate conditions such as restricted eigenvalue or irrepresentability and suitable regularization parameter scaling ( $\lambda_n\to 0$ , $\sqrt n \lambda_n\to\infty$ ) (Luo et al., 2023, Bertsimas et al., 2019).
Estimation and prediction consistency: Adaptive penalties can yield estimation error rates matching those of the oracle estimator, with $O(n^{-1/2})$ rates in parametric or sufficiently rich nonlinear models (Luo et al., 2023, Xu et al., 2022).
Approximation bounds: For top- $k$ regularization and sparse neural models, uniform error bounds can be shown for approximating high-dimensional sparse functions, with rates depending on the number of hidden units and intrinsic sparsity $k$ (Wu et al., 2021).
Phase transition behavior: In high-dimensional regimes, the probability of exact support recovery exhibits sharp phase transitions depending on the sample size to dimension ratio and sparsity level, extending classical compressed sensing results to nonlinear and neural settings (Sardy et al., 26 Nov 2024).

A plausible implication is that, for many practical problems, nonconvex penalties or hard cardinality constraints provide superior support recovery and feature screening compared to standard convex surrogates (LASSO) when computational budgets are adequate and signal-to-noise ratios are not too low.

4. Deep Learning and Nonlinear Feature Selection

Sparse regularization has been successfully propagated into neural networks via several principled approaches:

Sparse-input and group-sparse neural networks: Penalties applied to the $\ell_2$ norm of first-layer weights from each input node (group LASSO, SCAD, MCP) effect feature selection by zeroing entire groups, enabling combinatorial feature selection in nonparametric function estimation (Luo et al., 2023, Xu et al., 2022).
Architectural constraints: LassoNet enforces a “strong hierarchy” where features are usable in hidden layers only if their linear weights are active, ensuring interpretable, globally sparse deep nets (Lemhadri et al., 2019).
Binary stochastic filtering: Stochastic feature gates enable learnable, hard feature selection with minimal computational overhead (Trelin et al., 2020).
HarderLASSO and universal thresholding: Nonconvex, entrywise penalties interpolating between $\ell_1$ and $\ell_0$ (e.g., $P_\nu$ ) enable exact zeros with block coordinate and thresholding updates, and can be calibrated via quantile universal thresholding—eliminating the need for cross-validation (Sardy et al., 26 Nov 2024).
Top- $k$ masking and subnetwork regularization: By explicitly optimizing a subarchitecture on the $k$ largest-magnitude input weights, one can enforce exact feature sparsity at each iteration within arbitrary DNNs, linked to theoretical approximation guarantees (Wu et al., 2021).

Empirical studies confirm that nonconvex group penalties (SCAD, MCP), structured regularization (e.g., control via groupings, feature costs), and stochastic/architectural gating can dramatically outperform plain $\ell_1$ or group LASSO in controlling false positives, minimization bias, and model size without deteriorating predictive power (Luo et al., 2023, Xu et al., 2022, Sardy et al., 26 Nov 2024, Wu et al., 2021).

5. Unsupervised and Multi-label Feature Selection

Feature selection in unsupervised settings (lacking labels) requires coupling sparse regularization with geometric or information-theoretic structure discovery:

Sparse PCA frameworks: Embedding $\ell_{2,p}$ or hybrid (bi-sparse $\ell_{2,p}+\ell_q$ ) penalties under PCA or matrix factorization enables variable selection based on reconstructive variance explained, offering model selection without supervision (Xiu et al., 22 Dec 2024, Li et al., 2020, Allen et al., 2013).
Structured and adaptive manifold regularization: $\ell_{2,0}$ and $\ell_{2,1}$ penalties combined with graph/entropy based structure learning drive sparse subset selection, preserving both local data geometry and global structure (Sun et al., 2020).
Joint regularization for multi-label tasks: Mixtures of $\ell_{2,1}$ (group-sparse) and Frobenius penalties (elastic net style) resolve both sparsity and multicollinearity, and, when integrated with random-walk manifold regularizers, deliver state-of-the-art multi-label feature selection (Li et al., 2022).
Block-coordinate and alternating minimization algorithms: These methods provide practical ways to traverse the nonconvex, high-dimensional optimization landscape, with global convergence guarantees in some settings (notably for PAM-Riemannian solvers in bi-sparse PCA (Xiu et al., 22 Dec 2024)).

Recent empirical evidence demonstrates that hybrid sparsity objectives (e.g., both group and entry locally via bi-sparse norms) improve clustering and recovery of true latent structure compared to single-type sparse penalties (Xiu et al., 22 Dec 2024).

6. Extensions, Applications, and Limitations

Sparse regularization and feature selection methods now span a wide spectrum of applications:

Interpretable machine learning: Sparse neural additive models and group-sparse neural networks yield both high predictive accuracy and transparent variable importance, essential for regulatory or scientific insight (Xu et al., 2022, Lemhadri et al., 2019).
Nonlinear interaction mining: Regularized factorization machines with sparse interaction regularization (e.g., triangle-inequality and Cauchy–Schwarz quasi-norms) now target interaction-level sparsity beyond basic feature selection (Atarashi et al., 2020).
Multi-task and multi-output selection: Structured penalties (e.g., $\ell_{1,\infty}$ ) promote joint feature sharing across tasks, improving sample efficiency and support overlap in biometric, text, and genomics applications (Liang et al., 2011).
Fast face recognition and real-time models: Nonconvex $\ell_{1/2}$ penalties and hierarchical feature pipelines enable ultra-fast, compressed-dictionary sparse representation, directly benefiting pattern recognition with high-dimensional signals (Han et al., 2014).
Tree-based and ensemble selection: Weighted-lasso over tree-ensemble basis functions and $\ell_0$ best-subset MIQP in frameworks like ControlBurn create interpretable, feature-sparse nonlinear models with explicit tradeoffs between model complexity and prediction (Liu et al., 2022).

Key limitations include increased nonconvexity (yielding only local minima), sensitivity to hyperparameter choice, need for careful initialization, and in some cases, the necessity of user-specified cardinality hyperparameters ( $k$ in top- $k$ regularization) (Wu et al., 2021, Sardy et al., 26 Nov 2024). Scalability can be limited by cubic complexity in matrix eigendecompositions for very high-dimensional data, though iterative and stochastic variants show promise for large-scale deployment (Li et al., 2020, Sun et al., 2020).

7. Practical Recommendations and Outlook

Method Selection: For interpretability and support recovery, direct $\ell_0$ or nonconvex group penalties (MCP, SCAD, $\ell_{2,0}$ ) are preferred when computationally feasible. LASSO and elastic net remain baseline choices for large-scale screening or high-noise regimes (Bertsimas et al., 2019, Sun et al., 2020).
Hyperparameter Tuning: Grid search over regularization weights, pathwise continuation, and validation-based selection are commonly recommended; universal thresholding (QUT) is emerging as a principled, data-driven alternative (Sardy et al., 26 Nov 2024).
Initialization and Implementation: Warm starts, SVD-based initialization, and backward (dense-to-sparse) continuation yield stability in optimization. Efficient proximal, block, and coordinate algorithms are crucial for practical runtimes (Luo et al., 2023, Xu et al., 2022).
Extensions: Future directions include automatic or Bayesian hyperparameter selection, scalable stochastic solvers, extension to structured/overlapping groups, interaction and higher-order feature selection, and backbone integration into automated machine learning pipelines.

Sparse regularization and feature selection will continue to be fundamental tools in high-dimensional modeling, providing rigorously grounded, interpretable, and computationally tractable strategies across supervised, unsupervised, and deep learning contexts (Luo et al., 2023, Xiu et al., 22 Dec 2024, Xu et al., 2022, Sun et al., 2020, Wu et al., 2021).