Hadamard-Based Weight Smoothing in Sparse Learning

Updated 26 March 2026

Hadamard-based weight smoothing is a framework that transforms non-smooth, sparsity-regularized problems into equivalent smooth ones using overparametrization with element-wise products.
It introduces surrogate smooth penalties through auxiliary variables, allowing standard gradient-based optimizers to efficiently navigate sparse learning tasks.
Empirical results demonstrate significant sparsity and model compression in high-dimensional regression and neural network training with minimal performance trade-offs.

Hadamard-based weight smoothing is a general framework for converting non-smooth, sparsity-regularized optimization problems into equivalent smooth problems via overparametrization with Hadamard (element-wise) products or powers. By introducing smooth surrogate variables and reparametrizations, this approach enables the use of standard gradient-based optimization algorithms for sparse learning tasks, while preserving the underlying minima structure of non-smooth objectives (Kolb et al., 2023).

1. Base Formulation and Motivation

Sparse regularization, as found in penalties such as the $\ell_1$ norm or group norms, is fundamental in high-dimensional regression and model compression. These penalties give rise to optimization problems of the form

$P(\psi, w) = L(\psi, w) + \lambda R(w),$

where $L$ is a smooth loss (typically empirical risk), and $R$ is a non-smooth, sparsity-promoting regularizer (e.g., $R(w) = \|w\|_1$ or $\sum_j \sqrt{|G_j|} \|w_{G_j}\|_2$ for grouped weights). The non-differentiability and often non-convexity of $R$ (especially for $\ell_q$ with $q < 1$ ) impede the application of standard SGD, leading to oscillatory dynamics near zero, poor sparsity, and slow convergence.

2. Hadamard Overparametrization: Construction and Variants

To address non-smooth regularization, auxiliary variables $\xi$ are introduced with a smooth surjection $K: \xi \mapsto w$ . The simplest case is the depth-2 Hadamard product parametrization (HPP):

$w = u \odot v, \quad u, v \in \mathbb{R}^d.$

More generally, the approach encompasses various structured parametrizations:

Parametrization	Formula	Induced Regularization
HPP $_k$	$w = u_1 \odot \cdots \odot u_k$	$\ell_{2/k}$
GHPP (group)	$w_{G_j} = u_{G_j} \nu_j$	group- $\ell_{2,1}$
GHPP $_{k_1, k_1+k_2}$ (mixed group)	$w_{G_j} = (\odot_{t=1}^{k_1}\mu_{jt})(\odot_{r=1}^{k_2}\nu_{jr})$	$\ell_{2/k_1, 2/(k_1+k_2)}$
HPowP $_k$	$w = u \odot \|v\|^{k-1}$	$\ell_{2/k}$

Here, $K$ is always smooth and surjective, implying every $w$ has infinitely many preimages $\xi$ such that $K(\xi) = w$ .

3. Surrogate Smooth Penalties and Variational Equivalence

A key component is the smooth surrogate penalty $S(\xi)$ in the auxiliary space, usually a weighted $\ell_2$ norm. For HPP, $S(u,v) = \sum_j (u_j^2 + v_j^2)$ . By an elementary AM-GM argument,

$\min_{u \odot v = \beta} (u^2 + v^2) = 2 |\beta|,$

meaning the surrogate penalty achieves the sparsity penalty exactly on the fiber $K^{-1}(\beta)$ . More generally,

$R(w) = \min_{\xi: K(\xi) = w} S(\xi), \quad S(\xi) \geq R(K(\xi)),$

with equality for optimal $\xi$ . This construction generates closed-form surrogates for:

$\ell_{2/k}$ via HPP $_k$
group $\ell_{2,1}$ via GHPP
mixed $\ell_{2,2/k}$ via deeper group products
general $\ell_{2/k}$ for real $k > 1$ via HPowP

4. Equivalence of Minima and Theoretical Guarantees

The surrogate objective is defined as:

$Q(\psi, \xi) = L(\psi, K(\xi)) + \lambda S(\xi).$

Under two mild conditions—local openness of $K$ at optimal $\xi$ and upper-hemicontinuity of the minimizer map $S(\xi): K(\xi) = w$ —the set of local and global minima of $Q$ and $P$ coincide:

Any local minimum $(\psi^*, w^*)$ of $P$ corresponds to local minima $(\psi^*, \xi^*)$ of $Q$ with $K(\xi^*) = w^*$ .
Conversely, local minima of $Q$ project to local minima of $P$ via $K$ .
Global minima of $P$ and $Q$ match, since $S$ majorizes $R$ and achieves equality at constrained minima.

All parametrizations and surrogates are $C^\infty$ except possibly at coordinate-wise zero; the regularizer handles these singularities. No spurious local minima are introduced.

5. Gradient-Based Optimization Algorithms

Hadamard-based weight smoothing enables the direct use of SGD or Adam for sparse regularized objectives. For $\ell_1$ via HPP, gradient steps proceed as follows:

Let $w^{(t)} = u^{(t)} \odot v^{(t)}$ .
Compute $g = \partial L / \partial w$ at $w^{(t)}$ .
Update:
- $\nabla_u Q = g \odot v^{(t)} + 2\lambda u^{(t)}$
- $\nabla_v Q = g \odot u^{(t)} + 2\lambda v^{(t)}$
- $u^{(t+1)} = u^{(t)} - \eta \nabla_u Q$
- $v^{(t+1)} = v^{(t)} - \eta \nabla_v Q$

Deeper or group variants use appropriate Jacobian factors and distribute the $\lambda S$ terms. Initialization can use small random values or AM-GM matched factors. This approach requires only substituting the $w$ parameters with auxiliary variables and $\ell_2$ regularization, without the need for proximal operators or custom solvers.

6. Empirical Evaluation and Use Cases

Several empirical studies demonstrate the practicality of Hadamard-based weight smoothing:

High-Dimensional Regression: On synthetic problems $(n=500, d=1000, s=10)$ , HPP-SGD closely matches Lasso regularization paths, produces exact sparsity, and converges reliably. Direct SGD on nonsmooth $\ell_1$ is inferior, failing to produce zeros due to oscillation.
Sparse Neural Network Training: Fully-connected LeNet-300-100 models trained with HPP $_k$ on MNIST, using $\ell_2$ regularization on $(u, v)$ , yield weights $w$ with pronounced $\ell_1$ sparsity. After one-shot pruning, up to 99% parameter reduction is achievable with minimal loss increase. Deeper factorizations (HPP $_k$ for $k>2$ ) further enhance sparsity.
Filter-Sparse CNNs: Applying group Hadamard powers to convolution filter groups achieves up to 90% filter removal after training from scratch, with negligible accuracy degradation.

7. Comparison with Existing Methodologies

Hadamard-based weight smoothing unifies various sparsity-inducing formulations:

Universality: Accommodates $\ell_1$ , $\ell_p$ ( $p<1$ ), $\ell_{2,1}$ , and more within a single SGD-compatible framework.
Theoretical Guarantees: Ensures equivalence of all minima between original and surrogate problems, with no spurious solutions.
Practical Performance: Matches specialized solvers (e.g., glmnet, SGL) in high-dimensional regression; delivers substantial sparsity in neural networks using standard SGD.
Implementation Simplicity: Replaces $w$ with surrogate variables; relies only on standard $\ell_2$ regularization and smooth optimization.
Computational Overhead: Increases parameter count moderately; shallow models incur minimal extra compute. For deep/large models, parameter sharing and twin initializations can mitigate costs.

Hadamard-based weight smoothing offers a plug-and-play technique for achieving exact sparse regularization via smooth objectives, applicable across a diverse range of models and compatible with modern deep learning toolkits and optimizers (Kolb et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hadamard-Based Weight Smoothing.

Hadamard-Based Weight Smoothing in Sparse Learning

1. Base Formulation and Motivation

2. Hadamard Overparametrization: Construction and Variants

3. Surrogate Smooth Penalties and Variational Equivalence

4. Equivalence of Minima and Theoretical Guarantees

5. Gradient-Based Optimization Algorithms

6. Empirical Evaluation and Use Cases

7. Comparison with Existing Methodologies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hadamard-Based Weight Smoothing in Sparse Learning

1. Base Formulation and Motivation

2. Hadamard Overparametrization: Construction and Variants

3. Surrogate Smooth Penalties and Variational Equivalence

4. Equivalence of Minima and Theoretical Guarantees

5. Gradient-Based Optimization Algorithms

6. Empirical Evaluation and Use Cases

7. Comparison with Existing Methodologies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research