Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms (1803.01454v3)

Published 5 Mar 2018 in stat.CO, math.OC, and stat.ML

Abstract: The $L_0$-regularized least squares problem (a.k.a. best subsets) is central to sparse statistical learning and has attracted significant attention across the wider statistics, machine learning, and optimization communities. Recent work has shown that modern mixed integer optimization (MIO) solvers can be used to address small to moderate instances of this problem. In spite of the usefulness of $L_0$-based estimators and generic MIO solvers, there is a steep computational price to pay when compared to popular sparse learning algorithms (e.g., based on $L_1$ regularization). In this paper, we aim to push the frontiers of computation for a family of $L_0$-regularized problems with additional convex penalties. We propose a new hierarchy of necessary optimality conditions for these problems. We develop fast algorithms, based on coordinate descent and local combinatorial optimization, that are guaranteed to converge to solutions satisfying these optimality conditions. From a statistical viewpoint, an interesting story emerges. When the signal strength is high, our combinatorial optimization algorithms have an edge in challenging statistical settings. When the signal is lower, pure $L_0$ benefits from additional convex regularization. We empirically demonstrate that our family of $L_0$-based estimators can outperform the state-of-the-art sparse learning algorithms in terms of a combination of prediction, estimation, and variable selection metrics under various regimes (e.g., different signal strengths, feature correlations, number of samples and features). Our new open-source sparse learning toolkit L0Learn (available on CRAN and Github) reaches up to a three-fold speedup (with $p$ up to $10^6$) when compared to competing toolkits such as glmnet and ncvreg.

Citations (169)

View on Semantic Scholar

Summary

The paper introduces new coordinate descent and local combinatorial optimization algorithms to efficiently solve L0-regularized best subset selection problems.
Empirical results show their L0Learn toolkit is up to three times faster than glmnet and ncvreg, outperforming other methods in prediction and variable selection.
These methods make L0-based sparse learning practical for large-scale, high-dimensional datasets, providing computational and statistical advantages for interpretable models.

An Overview of Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms

The paper by Hazimeh and Mazumder presents innovative computational strategies for solving the $L_0$ -regularized least squares problem, commonly known as the best subset selection problem. This problem is pivotal in sparse statistical learning and has implications across statistics, machine learning, and optimization domains. Traditionally, addressing these problems using mixed integer optimization (MIO) comes with substantial computational costs, particularly as problem sizes scale. Hazimeh and Mazumder aim to advance computation in $L_0$ -regularized settings by integrating additional convex penalties.

The authors introduce new algorithms leveraging coordinate descent and local combinatorial optimization to efficiently approach $L_0$ -based sparse learning problems. Their work proposes a novel series of necessary optimality conditions, introducing a hierarchy of local minima to guide algorithmic development. Notably, the experiments underscore conditions where pure $L_0$ estimators benefit from added convex regularization, especially in low signal-to-noise settings.

Numerical Results and Key Findings

Empirical Performance: Through extensive empirical assessments, the proposed $L_0$ -based estimators often outperform existing state-of-the-art sparse learning techniques. They demonstrate superiority across various statistical metrics, such as prediction, estimation, and variable selection, under diverse signal strength and feature correlation scenarios.
Toolkit Development: The introduction of the open-source package L0Learn represents a significant contribution, offering up to three-fold speedups compared to comparative toolkits like glmnet and ncvreg, particularly for cases where the number of features ( $p$ ) approaches $10^6$ .
Algorithmic Insights: The paper elucidates the efficiency of coordinate descent and local combinatorial optimization, showing their convergence properties and the effectiveness of leveraging a hierarchy of local minima for optimization. While coordinate descent remains the core, the inclusion of local combinatorial search is key for exploring higher quality solutions.

Theoretical and Practical Implications

The research highlights the computational and statistical advantages of $L_0$ -based sparse learning, particularly spotlighting the nuanced benefits of integrating $L_1$ or $L_2$ regularization for stability in lower signal settings. The devised hierarchy of local minima enhances understanding of solution quality beyond traditional stationary points, suggesting pathways for achieving near-optimal solutions without prohibitive computation.

In practice, as the demand for interpretable models in high-dimensional spaces grows, the methodologies proposed offer compelling alternatives to existing sparse learning strategies that rely on $L_1$ regularization alone. Moreover, their scalability addresses practical constraints in large-scale data environments, deepening the feasibility of applying sparse models in real-world scenarios requiring high interpretability.

Speculations on Future AI Developments

Looking ahead, the integration of these methods into broader AI systems could yield advancements in fields requiring structured model selection under sparsity, particularly as datasets continue to grow in dimension and complexity. The interplay of optimization techniques such as MIO with sophisticated local search strategies may further push the frontiers of what is computationally tractable in machine learning, bridging gaps between model efficiency and interpretability at scale.

The paper by Hazimeh and Mazumder provides a substantial leap forward in both theory and application for best subset problem solutions, offering a robust framework adaptable to various problem settings within sparse statistical learning.