High-dimensional nonconvex lasso-type $M$-estimators

Published 12 Apr 2022 in math.ST | (2204.05792v2)

Abstract: This paper proposes a theory for $\ell_1$-norm penalized high-dimensional $M$-estimators, with nonconvex risk and unrestricted domain. Under high-level conditions, the estimators are shown to attain the rate of convergence $s_0\sqrt{\log(nd)/n}$, where $s_0$ is the number of nonzero coefficients of the parameter of interest. Sufficient conditions for our main assumptions are then developed and finally used in several examples including robust linear regression, binary classification and nonlinear least squares.

Abstract PDF Upgrade to Chat

Summary

The paper establishes that ℓ1-penalized M-estimators achieve a fast non-asymptotic rate of s₀√(log(nd)/n) in high dimensions, even with nonconvex loss functions.
It demonstrates that local strong convexity and uniform convergence over data-dependent ℓ1-balls suffice, bypassing the need for global convexity or compact domain constraints.
Applications in robust regression, binary classification, and nonlinear least squares illustrate the method’s broad utility and potential for sparse recovery.

High-dimensional Nonconvex Lasso-type $M$ -estimators: Summary and Analysis

Problem Statement and Motivation

This paper provides a comprehensive theoretical analysis of high-dimensional $M$ -estimators with $\ell_1$ regularization, where the empirical risk function $\widehat{R}(\theta)$ may be nonconvex and the parameter space $\Theta$ is unrestricted. The regime of interest is when the number of parameters $d$ grows with, and may substantially exceed, the sample size $n$ —the canonical high-dimensional statistics scenario. Classical lasso-type results assume convexity and often impose constraints on $\Theta$ (e.g., restricting to $\ell_1$ or $\ell_2$ balls), which may not hold in modern robust or nonlinear supervised learning settings. This work addresses the fundamental question of whether $\ell_1$ -regularized estimators can still enjoy fast rates of convergence for general nonconvex $M$ -estimation problems without such convexity or domain restrictions.

Main Theoretical Contributions

The primary result establishes that penalized $M$ -estimators of the form

$\widehat{\theta}\in\argmin_{\theta\in \Theta}\left\{\widehat{R}(\theta) + \lambda_n|\theta|_1\right\}$

attain the non-asymptotic estimation rate $s_0\sqrt{\log(nd)/n}$ in $\ell_1$ norm under sparsity, where $s_0$ is the number of nonzero entries of the true parameter $\theta_0$ . This rate matches the minimax optimal rate for high-dimensional parameters under standard conditions in convex settings.

Critically, the analysis weakens the usual global convexity or restricted strong convexity requirements. Instead, local strong convexity of the population risk $R(\cdot)$ in an $\ell_2$ -ball around $\theta_0$ suffices, and the empirical risk $\widehat{R}$ only needs to converge uniformly over a growing $\ell_1$ -ball. The penalty ensures that, with high probability, the estimator $\widehat{\theta}$ lies within a data-driven $\ell_1$ -ball whose radius only grows slowly with $n$ and $d$ .

No explicit restrictions are needed on the domain $\Theta$ . This generality is significant since previous high-dimensional lasso literature, particularly for nonconvex empirical risk, typically requires optimization to be constrained to compact sets to guarantee uniform convergence and manage local minima.

High-Level Assumptions and Sufficient Conditions

The following high-level assumptions underpin the main results:

Identification: $R(\theta)$ achieves a unique minimum at $\theta_0$ and possesses a form of uniform local identifiability in a neighborhood of $\theta_0$ .
Uniform Convergence: The empirical risk $\widehat{R}(\theta)$ uniformly approximates $R(\theta)$ over the relevant region (the $\ell_1$ -ball).
Local Strong Convexity: $R(\cdot)$ is locally strongly convex near $\theta_0$ ; no global convexity is required.
Deviation Control: The increments $\widehat{\Delta}(\theta) - \widehat{\Delta}(\theta_0)$ scale in $\ell_1$ norm as $O\left(\sqrt{\log(nd)/n}\right)$ with high probability.

The authors systematically analyze when these high-level conditions are guaranteed in practice, providing explicit sufficient conditions on the risk and empirical process structure. Notably, the crucial deviation conditions are derived via contraction arguments and symmetrization, using empirical process theory adapted for dependent, high-dimensional parameter sets.

Applications

The findings are instantiated in several fundamental contexts:

Robust Regression: For robust estimators with bounded, differentiable loss (e.g., Tukey's bisquare), when the design $X$ is sub-Gaussian and bounded, the estimator achieves $\ell_1$ estimation error $O_P(s_0\sqrt{\log(nd)/n})$ without restricting $\Theta$ . Compared to prior work, the absence of domain constraints is highlighted.
Binary Classification: For models with a general link function $\sigma$ (e.g., logistic), as long as $\sigma$ is strictly increasing and bounded, similar rates are obtained. The key requirement is a bounded $\ell_2$ norm of $\theta_0$ .
Nonlinear Least Squares: When the nonlinear function $f$ is differentiable, bounded, and strictly increasing, and the regression error is Gaussian, a matched rate is proven.

Each application section carefully verifies that all technical conditions are met in these cases and compares the assumptions to those in prior results, such as [loh2017statistical] and [mei2018landscape], demonstrating broader applicability regarding parameter domains at the cost of focusing on global minima only.

Implications and Future Directions

The results have strong practical and theoretical implications:

General Validity of Lasso in Nonconvex Settings: The work demonstrates that $\ell_1$ -penalized $M$ -estimation in high dimensions retains its statistical optimality under minimal convexity conditions, greatly expanding the class of admissible loss functions and thus broadening applicability to robust and nonlinear problems.
No Need for Artificial Domain Constraints: One can analyze penalized estimators on the entire ambient space, opposed to previous analyses requiring constraint sets, provided the empirical process is well behaved in a data-dependent neighborhood.
Decoupling Optimization and Statistical Accuracy: While the estimator may not always be computable (due to possible nonconvex local minima), the statistical argument shows that if the global minimizer is obtained, optimal rates are achieved. Recent literature addresses optimization-statistics tradeoffs for local minima; this paper chooses to focus on global minima.
Pathway to Variable Selection: The study leaves open the variable selection problem (support recovery), which typically requires incoherence conditions for $\ell_1$ penalization. The analysis points to a possible benefit from nonconvex regularizers (such as SCAD or MCP), as suggested by [loh2017support].

Potential extensions include: investigating oracle inequalities for prediction error (not just estimation error), relaxing uniform identifiability assumptions, analyzing handling of semiparametric $M$ -estimators, and bridging to results for local minima in nonconvex landscapes.

Conclusion

This work rigorously establishes that $\ell_1$ -regularized $M$ -estimators, computed as global minimizers of possibly nonconvex empirical losses, exhibit optimal $\ell_1$ error rates in high dimensions, independent of constraints on the parameter space and under only local curvature assumptions. The approach broadens theoretical guarantees for sparse recovery well beyond the convex setting, and the results inform the design and analysis of high-dimensional estimators in robust and nonlinear modeling scenarios. Further research on optimization/statistical tradeoffs and variable selection in this setting is a promising direction.