Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hadamard-Based Weight Smoothing in Sparse Learning

Updated 26 March 2026
  • Hadamard-based weight smoothing is a framework that transforms non-smooth, sparsity-regularized problems into equivalent smooth ones using overparametrization with element-wise products.
  • It introduces surrogate smooth penalties through auxiliary variables, allowing standard gradient-based optimizers to efficiently navigate sparse learning tasks.
  • Empirical results demonstrate significant sparsity and model compression in high-dimensional regression and neural network training with minimal performance trade-offs.

Hadamard-based weight smoothing is a general framework for converting non-smooth, sparsity-regularized optimization problems into equivalent smooth problems via overparametrization with Hadamard (element-wise) products or powers. By introducing smooth surrogate variables and reparametrizations, this approach enables the use of standard gradient-based optimization algorithms for sparse learning tasks, while preserving the underlying minima structure of non-smooth objectives (Kolb et al., 2023).

1. Base Formulation and Motivation

Sparse regularization, as found in penalties such as the 1\ell_1 norm or group norms, is fundamental in high-dimensional regression and model compression. These penalties give rise to optimization problems of the form

P(ψ,w)=L(ψ,w)+λR(w),P(\psi, w) = L(\psi, w) + \lambda R(w),

where LL is a smooth loss (typically empirical risk), and RR is a non-smooth, sparsity-promoting regularizer (e.g., R(w)=w1R(w) = \|w\|_1 or jGjwGj2\sum_j \sqrt{|G_j|} \|w_{G_j}\|_2 for grouped weights). The non-differentiability and often non-convexity of RR (especially for q\ell_q with q<1q < 1) impede the application of standard SGD, leading to oscillatory dynamics near zero, poor sparsity, and slow convergence.

2. Hadamard Overparametrization: Construction and Variants

To address non-smooth regularization, auxiliary variables ξ\xi are introduced with a smooth surjection K:ξwK: \xi \mapsto w. The simplest case is the depth-2 Hadamard product parametrization (HPP):

w=uv,u,vRd.w = u \odot v, \quad u, v \in \mathbb{R}^d.

More generally, the approach encompasses various structured parametrizations:

Parametrization Formula Induced Regularization
HPPk_k w=u1ukw = u_1 \odot \cdots \odot u_k 2/k\ell_{2/k}
GHPP (group) wGj=uGjνjw_{G_j} = u_{G_j} \nu_j group-2,1\ell_{2,1}
GHPPk1,k1+k2_{k_1, k_1+k_2} (mixed group) wGj=(t=1k1μjt)(r=1k2νjr)w_{G_j} = (\odot_{t=1}^{k_1}\mu_{jt})(\odot_{r=1}^{k_2}\nu_{jr}) 2/k1,2/(k1+k2)\ell_{2/k_1, 2/(k_1+k_2)}
HPowPk_k w=uvk1w = u \odot |v|^{k-1} 2/k\ell_{2/k}

Here, KK is always smooth and surjective, implying every ww has infinitely many preimages ξ\xi such that K(ξ)=wK(\xi) = w.

3. Surrogate Smooth Penalties and Variational Equivalence

A key component is the smooth surrogate penalty S(ξ)S(\xi) in the auxiliary space, usually a weighted 2\ell_2 norm. For HPP, S(u,v)=j(uj2+vj2)S(u,v) = \sum_j (u_j^2 + v_j^2). By an elementary AM-GM argument,

minuv=β(u2+v2)=2β,\min_{u \odot v = \beta} (u^2 + v^2) = 2 |\beta|,

meaning the surrogate penalty achieves the sparsity penalty exactly on the fiber K1(β)K^{-1}(\beta). More generally,

R(w)=minξ:K(ξ)=wS(ξ),S(ξ)R(K(ξ)),R(w) = \min_{\xi: K(\xi) = w} S(\xi), \quad S(\xi) \geq R(K(\xi)),

with equality for optimal ξ\xi. This construction generates closed-form surrogates for:

  • 2/k\ell_{2/k} via HPPk_k
  • group 2,1\ell_{2,1} via GHPP
  • mixed 2,2/k\ell_{2,2/k} via deeper group products
  • general 2/k\ell_{2/k} for real k>1k > 1 via HPowP

4. Equivalence of Minima and Theoretical Guarantees

The surrogate objective is defined as:

Q(ψ,ξ)=L(ψ,K(ξ))+λS(ξ).Q(\psi, \xi) = L(\psi, K(\xi)) + \lambda S(\xi).

Under two mild conditions—local openness of KK at optimal ξ\xi and upper-hemicontinuity of the minimizer map S(ξ):K(ξ)=wS(\xi): K(\xi) = w—the set of local and global minima of QQ and PP coincide:

  • Any local minimum (ψ,w)(\psi^*, w^*) of PP corresponds to local minima (ψ,ξ)(\psi^*, \xi^*) of QQ with K(ξ)=wK(\xi^*) = w^*.
  • Conversely, local minima of QQ project to local minima of PP via KK.
  • Global minima of PP and QQ match, since SS majorizes RR and achieves equality at constrained minima.

All parametrizations and surrogates are CC^\infty except possibly at coordinate-wise zero; the regularizer handles these singularities. No spurious local minima are introduced.

5. Gradient-Based Optimization Algorithms

Hadamard-based weight smoothing enables the direct use of SGD or Adam for sparse regularized objectives. For 1\ell_1 via HPP, gradient steps proceed as follows:

  • Let w(t)=u(t)v(t)w^{(t)} = u^{(t)} \odot v^{(t)}.
  • Compute g=L/wg = \partial L / \partial w at w(t)w^{(t)}.
  • Update:
    • uQ=gv(t)+2λu(t)\nabla_u Q = g \odot v^{(t)} + 2\lambda u^{(t)}
    • vQ=gu(t)+2λv(t)\nabla_v Q = g \odot u^{(t)} + 2\lambda v^{(t)}
    • u(t+1)=u(t)ηuQu^{(t+1)} = u^{(t)} - \eta \nabla_u Q
    • v(t+1)=v(t)ηvQv^{(t+1)} = v^{(t)} - \eta \nabla_v Q

Deeper or group variants use appropriate Jacobian factors and distribute the λS\lambda S terms. Initialization can use small random values or AM-GM matched factors. This approach requires only substituting the ww parameters with auxiliary variables and 2\ell_2 regularization, without the need for proximal operators or custom solvers.

6. Empirical Evaluation and Use Cases

Several empirical studies demonstrate the practicality of Hadamard-based weight smoothing:

  • High-Dimensional Regression: On synthetic problems (n=500,d=1000,s=10)(n=500, d=1000, s=10), HPP-SGD closely matches Lasso regularization paths, produces exact sparsity, and converges reliably. Direct SGD on nonsmooth 1\ell_1 is inferior, failing to produce zeros due to oscillation.
  • Sparse Neural Network Training: Fully-connected LeNet-300-100 models trained with HPPk_k on MNIST, using 2\ell_2 regularization on (u,v)(u, v), yield weights ww with pronounced 1\ell_1 sparsity. After one-shot pruning, up to 99% parameter reduction is achievable with minimal loss increase. Deeper factorizations (HPPk_k for k>2k>2) further enhance sparsity.
  • Filter-Sparse CNNs: Applying group Hadamard powers to convolution filter groups achieves up to 90% filter removal after training from scratch, with negligible accuracy degradation.

7. Comparison with Existing Methodologies

Hadamard-based weight smoothing unifies various sparsity-inducing formulations:

  • Universality: Accommodates 1\ell_1, p\ell_p (p<1p<1), 2,1\ell_{2,1}, and more within a single SGD-compatible framework.
  • Theoretical Guarantees: Ensures equivalence of all minima between original and surrogate problems, with no spurious solutions.
  • Practical Performance: Matches specialized solvers (e.g., glmnet, SGL) in high-dimensional regression; delivers substantial sparsity in neural networks using standard SGD.
  • Implementation Simplicity: Replaces ww with surrogate variables; relies only on standard 2\ell_2 regularization and smooth optimization.
  • Computational Overhead: Increases parameter count moderately; shallow models incur minimal extra compute. For deep/large models, parameter sharing and twin initializations can mitigate costs.

Hadamard-based weight smoothing offers a plug-and-play technique for achieving exact sparse regularization via smooth objectives, applicable across a diverse range of models and compatible with modern deep learning toolkits and optimizers (Kolb et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hadamard-Based Weight Smoothing.