Hadamard-Based Weight Smoothing in Sparse Learning
- Hadamard-based weight smoothing is a framework that transforms non-smooth, sparsity-regularized problems into equivalent smooth ones using overparametrization with element-wise products.
- It introduces surrogate smooth penalties through auxiliary variables, allowing standard gradient-based optimizers to efficiently navigate sparse learning tasks.
- Empirical results demonstrate significant sparsity and model compression in high-dimensional regression and neural network training with minimal performance trade-offs.
Hadamard-based weight smoothing is a general framework for converting non-smooth, sparsity-regularized optimization problems into equivalent smooth problems via overparametrization with Hadamard (element-wise) products or powers. By introducing smooth surrogate variables and reparametrizations, this approach enables the use of standard gradient-based optimization algorithms for sparse learning tasks, while preserving the underlying minima structure of non-smooth objectives (Kolb et al., 2023).
1. Base Formulation and Motivation
Sparse regularization, as found in penalties such as the norm or group norms, is fundamental in high-dimensional regression and model compression. These penalties give rise to optimization problems of the form
where is a smooth loss (typically empirical risk), and is a non-smooth, sparsity-promoting regularizer (e.g., or for grouped weights). The non-differentiability and often non-convexity of (especially for with ) impede the application of standard SGD, leading to oscillatory dynamics near zero, poor sparsity, and slow convergence.
2. Hadamard Overparametrization: Construction and Variants
To address non-smooth regularization, auxiliary variables are introduced with a smooth surjection . The simplest case is the depth-2 Hadamard product parametrization (HPP):
More generally, the approach encompasses various structured parametrizations:
| Parametrization | Formula | Induced Regularization |
|---|---|---|
| HPP | ||
| GHPP (group) | group- | |
| GHPP (mixed group) | ||
| HPowP |
Here, is always smooth and surjective, implying every has infinitely many preimages such that .
3. Surrogate Smooth Penalties and Variational Equivalence
A key component is the smooth surrogate penalty in the auxiliary space, usually a weighted norm. For HPP, . By an elementary AM-GM argument,
meaning the surrogate penalty achieves the sparsity penalty exactly on the fiber . More generally,
with equality for optimal . This construction generates closed-form surrogates for:
- via HPP
- group via GHPP
- mixed via deeper group products
- general for real via HPowP
4. Equivalence of Minima and Theoretical Guarantees
The surrogate objective is defined as:
Under two mild conditions—local openness of at optimal and upper-hemicontinuity of the minimizer map —the set of local and global minima of and coincide:
- Any local minimum of corresponds to local minima of with .
- Conversely, local minima of project to local minima of via .
- Global minima of and match, since majorizes and achieves equality at constrained minima.
All parametrizations and surrogates are except possibly at coordinate-wise zero; the regularizer handles these singularities. No spurious local minima are introduced.
5. Gradient-Based Optimization Algorithms
Hadamard-based weight smoothing enables the direct use of SGD or Adam for sparse regularized objectives. For via HPP, gradient steps proceed as follows:
- Let .
- Compute at .
- Update:
Deeper or group variants use appropriate Jacobian factors and distribute the terms. Initialization can use small random values or AM-GM matched factors. This approach requires only substituting the parameters with auxiliary variables and regularization, without the need for proximal operators or custom solvers.
6. Empirical Evaluation and Use Cases
Several empirical studies demonstrate the practicality of Hadamard-based weight smoothing:
- High-Dimensional Regression: On synthetic problems , HPP-SGD closely matches Lasso regularization paths, produces exact sparsity, and converges reliably. Direct SGD on nonsmooth is inferior, failing to produce zeros due to oscillation.
- Sparse Neural Network Training: Fully-connected LeNet-300-100 models trained with HPP on MNIST, using regularization on , yield weights with pronounced sparsity. After one-shot pruning, up to 99% parameter reduction is achievable with minimal loss increase. Deeper factorizations (HPP for ) further enhance sparsity.
- Filter-Sparse CNNs: Applying group Hadamard powers to convolution filter groups achieves up to 90% filter removal after training from scratch, with negligible accuracy degradation.
7. Comparison with Existing Methodologies
Hadamard-based weight smoothing unifies various sparsity-inducing formulations:
- Universality: Accommodates , (), , and more within a single SGD-compatible framework.
- Theoretical Guarantees: Ensures equivalence of all minima between original and surrogate problems, with no spurious solutions.
- Practical Performance: Matches specialized solvers (e.g., glmnet, SGL) in high-dimensional regression; delivers substantial sparsity in neural networks using standard SGD.
- Implementation Simplicity: Replaces with surrogate variables; relies only on standard regularization and smooth optimization.
- Computational Overhead: Increases parameter count moderately; shallow models incur minimal extra compute. For deep/large models, parameter sharing and twin initializations can mitigate costs.
Hadamard-based weight smoothing offers a plug-and-play technique for achieving exact sparse regularization via smooth objectives, applicable across a diverse range of models and compatible with modern deep learning toolkits and optimizers (Kolb et al., 2023).