Smooth Optimization for Sparse Regularization
- Smooth optimization for sparse regularization approximates non-smooth ℓ₁/ℓ₀ penalties with differentiable surrogates to enable gradient-based methods with provable convergence.
- Methodologies like iterative reweighted least squares, majorization-minimization, and bilevel programming preserve optimality while reducing computational complexity.
- These approaches deliver scalable algorithms for high-dimensional regression, feature selection, and inverse problems with robust recovery and approximation guarantees.
Smooth optimization for sparse regularization refers to the body of methods that approximate, reformulate, or decouple non-smooth sparsity-inducing penalties—such as the ℓ₁ or ℓ₀ norms—using smooth objective functions, analyzable surrogates, or alternative parametrizations. This enables the use of efficient, off-the-shelf optimization methods with strong convergence guarantees and computational scalability in high-dimensional or structured optimization problems. Research in this area establishes the theoretical equivalence of certain smooth formulations to their non-smooth counterparts, generalizes recovery guarantees, and delivers new numerical procedures for sparse estimation, structured signal recovery, inverse problems, and beyond.
1. Smooth Optimization Principles in Sparse Regularization
Sparse optimization problems typically enforce sparsity through non-smooth terms (e.g., ℓ₁, ℓ₀ norms, mixed-norms) in objectives such as
or, for exact cardinality,
where is a (possibly non-quadratic) loss. The resulting non-differentiability hinders the direct use of gradient-based methods and complicates both theory and computation.
Smooth optimization techniques address these obstacles by:
- Approximating non-smooth penalties with smooth surrogate functions, allowing for gradient or quasi-Newton methods with provable rates (Voronin et al., 2015, Benfenati et al., 2023, Akaishi et al., 24 Sep 2025).
- Reformulating sparse regularization as a smooth constrained or unconstrained problem through overparametrization, relaxation, or bilevel programming, preserving or improving global/local optimality structures (Kolb et al., 2023, Poon et al., 2021, Kanzow et al., 2022).
- Decoupling composite objectives (e.g., sparse plus smooth) to reduce computational complexity and leverage dedicated solvers for each term while maintaining theoretical equivalence to the original coupled problem (Jarret et al., 8 Mar 2024, Jarret et al., 27 Oct 2025).
These principles enable scalable algorithms for high-dimensional regression, structured learning, inverse problems, and have driven theoretical advances in understanding the empirical and worst-case performance of sparse regularization.
2. Theoretical Foundations and Guarantees
Smooth optimization approaches for sparse regularization rest on several key mathematical and statistical foundations:
- Fenchel Duality and Restricted Strong Convexity/Smoothness: Theoretical analysis of ℓ₁ and Group-ℓ₁ regularization in general convex settings demonstrates that for strictly convex, differentiable losses , and under restricted strong convexity (RSC) and smoothness (RSM), LASSO-type regularization achieves exact feature or group selection, paralleling greedy methods such as Orthogonal Matching Pursuit (OMP). The exact group entering at each path point is determined by the maximum -norm of the loss gradient (Axiotis et al., 2023).
- Equivalence under Optimization Transfer: Overparametrization via Hadamard or group-Hadamard products, with smooth surrogate (e.g., quadratic) penalties, guarantees that global and local minima of the surrogate problem match those of the original non-smooth objective. Losses and sublevel geometry are preserved, avoiding spurious solutions and enabling full compatibility with gradient-based optimizers (Kolb et al., 2023).
- Smooth Bilevel Programming and Variational Expressions: Many sparsity-inducing penalties can be written as the minimum over auxiliary smooth parameters (e.g., in quadratic variational form for block-separable regularizers), leading to bilevel smooth programs for which closed-form gradients and Hessians are computable. Notably, all second-order critical points correspond to global optima—no spurious minima arise and saddles are easily escapable (Poon et al., 2021).
- Complementarity-based Smoothing of ℓ₀: Reformulations introducing auxiliary variables and smooth complementarity constraints yield smooth nonlinear programs equivalent to original ℓ₀ problems in sense of local/global minima, with strong stationarity and second-order condition theory extending the Karush-Kuhn-Tucker framework to these settings (Kanzow et al., 2022).
These features collectively ensure that smooth optimization methods not only approximate but often exactly solve sparse regularization instances, sometimes outperforming classic non-smooth algorithms in both sparsity and objective values.
3. Methodologies and Algorithmic Frameworks
Smooth Surrogate and Smoothing-based Methods
- Iteratively Reweighted Least Squares (IRLS): Approximates (or , ) with a differentiable majorizer depending on a (schedule-decreasing) smoothing parameter. Each step reduces to a weighted ℓ₂ minimization, with smoothing diminishing as the iterate approaches stationarity. Monotonic objective reduction and convergence to optimality are established (Voronin et al., 2015).
- Ultra-discretization-based Smooth Penalties (ULPENS): Constructs smooth, non-convex, non-separable sparsity-inducing penalties interpolating between ℓ₁ and "min-abs" selectors via log-sum-exp smoothing. The resulting penalty is fully , with closed-form gradients and provable Lipschitz constants, and achieves stronger support recovery than other non-convex or ℓ₁ relaxations (Akaishi et al., 24 Sep 2025).
- Majorization-Minimization (MM): Employs data-driven, smooth tangent majorizers for both the loss (e.g., squared hinge) and the regularizer (e.g., hyperbolic for ℓ₁, Welsh for ℓ₀), enabling fast updates by constant-step or (preconditioned) Newton iterations. Automatic sparsity emerges from curvature-majorized diagonal entries in the update (Benfenati et al., 2023).
Overparametrization and Bilevel Approaches
- Hadamard Product and Group-Hadamard Parametrizations: Substitute non-smooth (possibly non-convex) sparse penalties with smooth quadratic surrogates in expanded parameter spaces, preserving global/local optima and enabling standard stochastic gradient descent (SGD), Adam, or quasi-Newton optimization in differentiable frameworks. Post-processing or long runs yield (near-)exact sparsity (Kolb et al., 2023).
- Smooth Bilevel Programming: Encodes the sparsity penalty variationally, transforming the regularized regression into a smooth bilevel problem where the inner minimization is analytically tractable and the upper-level function is differentiable, uniquely characterized, and free of spurious minima or plateaus (Poon et al., 2021).
Decoupling of Composite Regularization
- Sparse-plus-Smooth Models (Decoupling): For problems involving a sparse (Banach-space) and a smooth (Hilbert-space) component, decoupling the objective via a representer theorem allows reduction to a sparse-only problem, with the smooth component recovered in closed form. This provides both theoretical guarantees (composite representer theorems) and numerical acceleration, especially in large-scale or structured settings (Jarret et al., 27 Oct 2025, Jarret et al., 8 Mar 2024).
Table: Example Families of Smooth Sparse-regularizing Methods
| Approach | Surrogate/Parametrization | Optimization Domain |
|---|---|---|
| IRLS | Quadratic smoothing | Original variables |
| ULPENS | Log-sum-exp smoothing, non-sep. | Original variables |
| Hadamard Overparam. | Quadratic in expanded params | Overparam. variables |
| MM (SVM, regression) | Tangent majorant (local quad.) | Original variables |
| Bilevel Programming | Inner quadratic/outer smooth | Auxiliary + primal |
| Composite Decoupling | Weighted least-squares + prox | Reduced component |
4. Recovery Guarantees and Approximation Bounds
- Feature and Support Selection: For strictly convex losses with RSC/RSM, the path-following procedure for Group-LASSO selects the same groups/features as Group OMP, and both admit theoretical recovery and approximation guarantees, including bicriteria and worst-case rates as a function of the condition number (Axiotis et al., 2023).
- Approximation Error for Regularized OT: In applications such as optimal transport, smooth strongly-convex regularizers (e.g., squared ℓ₂, group-LASSO) yield approximately sparse solutions with provable error bounds relative to the unregularized solution. The error decay and sparsity are parameterized explicitly by the smoothness parameter, with the squared ℓ₂ penalty often outperforming entropic regularization in terms of approximation error for fixed computational effort (Blondel et al., 2017).
- Composite Representer Theorems: For Banach-Hilbert composite sparse-plus-smooth regularization, the extreme points of the minimizer remain finitely supported (≤number of measurements), extending classical representer theorems to mixed-norm, infinite-dimensional, or measure-valued problems (Jarret et al., 27 Oct 2025, Bredies et al., 2023).
5. Applications and Empirical Performance
- High-dimensional Regression and Feature Selection: Smooth surrogate methods, Hadamard overparametrization, and bilevel techniques demonstrate competitive or superior estimation error, test RMSE, and support recovery relative to standard LASSO or non-convex coordinate-pruning approaches. Deeper overparametrizations enhance support recovery and reduce false discoveries (Kolb et al., 2023, Poon et al., 2021).
- Inverse Problems and Signal Recovery: Sparse-plus-smooth decomposition, utilizing composite decoupling or infinite infimal convolution, enables effective recovery of both impulsive and smooth components, with enhanced runtime efficiency and interpretable sparsity patterns. For superresolution of Dirac impulses over smooth backgrounds, the decoupled algorithm achieves minimal ℓ₂ errors and significant acceleration over coupled approaches (Jarret et al., 27 Oct 2025, Jarret et al., 8 Mar 2024, Bredies et al., 2023).
- SVMs and Structured Learning: Application of smooth MM and surrogate penalties to sparse SVMs yields both explicit feature selection (many coefficients zero) and strong classification performance, with hybrid or subspace-accelerated variants offering significant reductions in iteration count and computational cost (Benfenati et al., 2023).
- Optimal Transport: Replacing entropy with smooth, strongly convex penalties leads to transport plans that are sparse or group-sparse, reduce approximation error, and yield scalable solvers with comparable or superior visual and statistical performance in tasks such as color transfer (Blondel et al., 2017).
6. Connections, Limitations, and Scope
Smooth optimization for sparse regularization unifies and generalizes several methodologies:
- It encompasses and explains the empirical success of standard LASSO and greedy selection methods under restricted convexity/smoothness, providing a unified submodularity-based performance bound (Axiotis et al., 2023).
- By leveraging decoupling, overparametrization, or smoothing, these frameworks are rendered compatible with modern deep learning pipelines and large-scale numerical computing (Kolb et al., 2023).
- Limitations include dependence on parameter tuning for smoothing, surrogate selection, or overparametrization depth. Attainment of strict sparsity may require post-processing or precise thresholding in smooth domains. The analysis of spurious local minima or slow convergence in certain non-convex cases remains nuanced (Kolb et al., 2023, Akaishi et al., 24 Sep 2025).
- Extensions encompass group and block sparsity, structured penalties, composite regularization, and “off-the-grid” infinite-dimensional settings. Future directions cited in recent research involve principled initialization, interactions with neural-network layers, and expansion to tensorized or variational-Bayes structures (Jarret et al., 27 Oct 2025, Kolb et al., 2023).
Smooth optimization for sparse regularization thus forms a rigorous and versatile toolkit, blending optimization theory, numerical analysis, and applied statistics to enable efficient, scalable, and theoretically-guaranteed sparse modeling in modern data science.