Smooth Probabilistic Reformulation of ℓ0 Regression

Updated 20 September 2025

The paper introduces a smooth probabilistic reformulation of ℓ0 regression that leverages exact analytical gradients to eliminate high-variance estimators.
It employs Bernoulli masking to replace combinatorial subset selection with a differentiable surrogate, enabling efficient high-dimensional and nonlinear regression.
Empirical results show that the method achieves faster convergence and improved robustness compared to ℓ1 relaxations and Monte Carlo sampling approaches.

A smooth probabilistic reformulation of $\ell_0$ -regularized regression refers to a set of methodologies which replace the nonconvex and combinatorial best subset selection problem with optimization procedures that combine smooth (differentiable) objectives and probabilistic surrogates for the $\ell_0$ “norm.” These approaches yield exact or analytic gradients, eliminate the need for high-variance Monte Carlo estimators, and enable efficient application to high-dimensional and nonlinear regression, as well as compressive sensing and neural network compression. The resulting methods exhibit favorable convergence, scalability, and superior statistical performance relative to traditional approaches using hard-thresholding, greedy search, or $\ell_1$ -based convex relaxations.

1. Classical $\ell_0$ -Regularized Regression and Its Computational Barriers

The canonical $\ell_0$ -regularized regression problem is

$\min_{\theta \in \mathbb{R}^p} \left\{ \|y - F\theta\|_2^2 + \lambda \|\theta\|_0 \right\}$

where $F \in \mathbb{R}^{n \times p}$ is the design matrix, $y \in \mathbb{R}^n$ is the response vector, and the $\ell_0$ -“norm” $\|\theta\|_0$ counts the number of nonzero components in $\theta$ . This best-subset selection problem is NP-hard and becomes combinatorially intractable for $p \gtrsim 40$ . In linear cases, alternatives via mixed-integer optimization and relaxations exist, but these scale poorly for high-dimensional or nonlinear models.

Exploring beyond direct mixed-integer formulations, a wide body of research has sought smooth relaxations or surrogates. Early formulations replaced $\ell_0$ by $\ell_1$ , leading to the lasso and its variants, but such proxies induce shrinkage bias and lack precise cardinality control, often yielding suboptimal support recovery in regimes with correlated designs or limited samples.

2. Probabilistic Reformulation: Bernoulli Masking and Expectation Surrogates

A key advance is to “lift” the optimization problem from searching over discrete support sets to a smooth, probabilistic space parameterized by Bernoulli random variables. Let $\theta = w \circ z$ , where $w \in \mathbb{R}^p$ are magnitudes and $z \in \{0,1\}^p$ is a binary mask. Instead of optimizing over all $2^p$ choices for $z$ , one defines a Bernoulli distribution $\pi_\gamma$ with $z_i \sim \operatorname{Bern}(\gamma_i)$ , $\gamma_i \in [0,1]$ .

The expected objective becomes

$\min_{w,\gamma \in [0,1]^p} \mathbb{E}_{z\sim\pi_\gamma} \left[ \|y - F(w \circ z)\|_2^2 + \lambda \|z\|_0 \right]$

This expectation can be computed exactly for the quadratic loss: $\mathbb{E}_{z \sim \pi_\gamma} \|y - F(w \circ z)\|_2^2 = \|y - F(w \circ \gamma)\|_2^2 + \sum_{i,j} F_{ij}^2 w_j^2 \gamma_j(1-\gamma_j)$ and $\mathbb{E}_{z} \|z\|_0 = \sum_j \gamma_j$ , yielding a new smooth, piecewise-differentiable objective in $(w,\gamma)$ : $\min_{w,\gamma \in [0,1]^p} \|y - F(w \circ \gamma)\|_2^2 + \sum_{i,j} F_{ij}^2 w_j^2 \gamma_j(1-\gamma_j) + \lambda \sum_j \gamma_j$ This smooth surrogate precisely captures the combinatorial expectation of the original objective, without introducing high-variance gradient estimators or Monte Carlo sampling as required in earlier stochastic relaxations (Barth et al., 18 Sep 2025, Barth et al., 23 May 2025).

3. Algorithmic Implementation and Exact Gradient Computation

By constructing a smooth objective in the product space $(w,\gamma)$ , standard gradient-based optimizers (Adam, SGD, LBFGS, etc.) can be employed efficiently. The closed-form expressions for gradients with respect to both $w$ and $\gamma$ are derived directly from the analytic expectation; for example,

$\frac{\partial}{\partial \gamma_k} \Big[ \|y - F(w \circ \gamma)\|_2^2 + \sum_{i,j} F_{ij}^2 w_j^2 \gamma_j(1-\gamma_j) + \lambda \sum_j \gamma_j \Big]$

can be computed without sampling. This enables large-batch updates and deterministic optimizers to exploit hardware acceleration.

Implementation of this approach has been demonstrated in high-dimensional settings (e.g., $p \sim 10^4$ ), both for linear regression, compressive sensing, and nonlinear neural network models such as convolutional networks and transformers (Barth et al., 18 Sep 2025, Barth et al., 23 May 2025). The resulting optimization exhibits guaranteed smoothness in $\gamma$ away from the boundary, eliminating the discontinuities and stochasticity inherent to MC-based surrogates such as Reinforce, DisARM, or the BitFlip estimator.

4. Comparison with Alternative Strategies

Approach	Differentiability	Sampling Required	Convergence Speed	Bias/Variance
$\ell_1$ -relax.	Yes	No	Fast	Severe shrinkage
Hard-threshold	Non-smooth	No	Slow (greedy)	Biased, unstable
MC Expectation	Smooth (avg)	Yes	Slow/unstable	High variance
EGP/PMMP (analytic expectation) [Editor's term]	Yes	No	Fast	Minimal bias

The smooth probabilistic reformulation strictly improves convergence rate and accuracy over iterative hard thresholding (IHT), lasso, and MC-based mask selection methods. Empirical studies show order-of-magnitude improvements in wall-clock convergence and accuracy, especially in high signal-to-noise and moderately underdetermined regimes (Barth et al., 18 Sep 2025, Barth et al., 23 May 2025).

5. Nonlinear and Neural Network Compression: Generalized Compressive Sensing

Extending the analytic expectation principle to nonlinear models, such as neural networks, one replaces the loss $\|y - F(w \circ \gamma)\|_2^2$ with a general differentiable loss $\mathcal{L}(f_{w \circ \gamma}, x)$ , where $f_{w \circ \gamma}$ is a nonlinear map (e.g., an MLP). In the linear-Gaussian case, the expectation over random maskings remains tractable and analytic; for generic nonlinearities, such as ReLU or tanh networks, explicit closed-form expressions may not always be available, but the same architectural relaxation applies and empirical evidence indicates substantially improved robustness and convergence in network pruning (Barth et al., 23 May 2025).

Moreover, the related minimax formulations with Lagrangian or penalty-based enforcement (e.g., introducing constraints $\theta = w \circ \gamma$ ) allow the framework to absorb further model or data structure, enabling applications to minimum description length (MDL) learning, synthetic teacher-student setups, and structured dataset compression.

6. Theoretical Guarantees and Fundamental Distinctions: Linear vs. Nonlinear

For linear compressive sensing, the analytic smooth reformulation preserves the ability to recover the true support of $\theta$ under favorable design matrix conditions (e.g., restricted isometry), provided the global minimum is attained. For the nonlinear case, global optima in the infinite-data limit correspond to recovery up to model symmetries (neuron permutation, sign flips) by results from Fefferman and Markel (Barth et al., 18 Sep 2025). However, empirical studies reveal a notable “ $\ell_2$ rebound effect”: as the model fit improves, parameter proximity between teacher and student networks can rebound, reflecting the fundamentally many-to-one mapping from parameters to functions in neural architectures. This indicates that function-level recovery and parameter recovery can diverge in nonlinear compressive sensing, even with strong $\ell_0$ regularization and smooth probabilistic surrogates.

7. Applications and Implications for Large-Scale and Structured Regression

The smooth probabilistic reformulation of $\ell_0$ -regularized regression has enabled advances in:

Scalable sparse regression for high-dimensional regression and signal recovery, both linear and nonlinear.
Sample-efficient neural network and dataset compression without reliance on surrogate convex losses or heuristics (Barth et al., 23 May 2025).
Improved robustness in low-sample or strongly correlated regimes versus $\ell_1$ -based approaches.
The possibility of integrating minimum description length (MDL) principles and Solomonoff induction analogs, wherein the $\ell_0$ penalty enforces compression directly analogous to model code length minimization (Barth et al., 23 May 2025).

A plausible implication is that this family of methods can serve as a universal drop-in replacement for hard and soft thresholding, as well as MC-sampling-based stochastic surrogates, whenever a closed-form or efficiently computed expectation exists for the regularized objective. Moreover, the analytic gradient structure makes these methods particularly suitable for modern auto-diff frameworks and GPU acceleration.

The smooth probabilistic reformulation of $\ell_0$ -regularized regression thus marks a significant conceptual and practical advance: by introducing deterministic, differentiable surrogates for subset selection and sparsity, it bridges the gap between expressiveness of $\ell_0$ -regularization and the computational tractability required for large-scale, complex architectures and data sources (Barth et al., 18 Sep 2025, Barth et al., 23 May 2025). The approach’s empirical superiority in both convergence and estimation error, together with its extensibility to nonlinear regressors and neural networks, secures its place as a preferred methodology for modern sparse modeling, compressive sensing, and scalable model compression.

PDF Markdown Chat (Pro)

References (2)

Probabilistic and nonlinear compressive sensing (2025)

Efficient compression of neural networks and datasets (2025)

Follow Topic

Get notified by email when new papers are published related to Smooth Probabilistic Reformulation of $\ell_0$ Regularized Regression.