Sparse Adam Optimization

Updated 26 January 2026

Sparse Adam is an optimization algorithm that adapts learning rates on a per-coordinate basis to handle sparse gradients and inactive neural channels.
Explicit variants like AdaBreg and Group Adam use proximal and group-wise updates to enforce structured sparsity with minimal accuracy loss.
Empirical findings show that Sparse Adam methods can achieve up to 70–85% sparsity in neural networks while maintaining competitive performance relative to dense models.

Sparse Adam refers to a class of stochastic optimization algorithms, descending from Adam, that are specifically designed or empirically observed to promote, exploit, or operate effectively under sparse neural network parameterizations or sparse gradients. This area includes both methods that leverage Adam’s inherent handling of sparse gradients, specialized algorithms that induce explicit weight sparsity (parameter-level, channel-level, or group-level), and proximal-gradient generalizations for structured sparsity in deep learning. The following sections review the mathematical principles, theoretical mechanisms, practical algorithms, and empirical findings in this field.

1. Adam and Sparse Gradients

Adam is a first-order optimization algorithm that combines elementwise adaptive learning rates with bias-corrected moment estimates. For sparse gradients, Adam’s per-coordinate adaptation ensures that updates are automatically rescaled depending on each coordinate’s recent gradient history. When a coordinate is infrequently updated (i.e., most $g_{t,i}$ are zero), its second-moment accumulator $v_{t,i}$ decays, and subsequent nonzero gradients are scaled by $\alpha/(\sqrt{v_{t,i}}+\epsilon)$ , yielding relatively larger steps and promoting efficient convergence in the presence of extreme sparsity (Kingma et al., 2014).

Adam’s memory and computational complexity are $O(d)$ , dominated by its $m_t$ and $v_t$ accumulators. For extremely high-dimensional sparse problems, a lazy update strategy—tracking last-modified timestamps—permits update costs proportional to the number of nonzero gradient entries.

2. Implicit Channel Sparsity Induced by Adam

In deep rectifier networks trained with Adam and $L_2$ weight decay, an implicit form of group sparsity emerges at the channel (neuron/feature output) level. Specifically, in feed-forward networks of the form

$J(\Theta) = \frac{1}{N}\sum_{n=1}^{N} \mathcal{L}(u_n, v_n; \Theta) + \frac{\lambda}{2} \sum_{l=2}^L \|W^{(l)}\|_F^2$

with ReLU activations and Adam optimization, it is commonly observed that many output-channel weight-vectors $w_k^{(l)}$ converge to exact zero (Yaguchi et al., 2018).

The mechanistic explanation is: when an output channel becomes “silent” (i.e., its pre-activation is negative on nearly all data), the data-driven gradient on that channel vanishes, leaving only the $L_2$ shrinkage term $g_t = \lambda w_t$ . Adam’s adaptive update rapidly drives these silent channels to machine precision zeros, with convergence at a doubly-exponential rate:

$w_t = O(\exp(-2^t))$

This rate is much faster than the linear decay of (momentum)-SGD. In practice, a post-training pruning step at threshold $\xi \approx 10^{-15}$ eliminates these zeroed channels, yielding highly compact networks.

3. Explicit Sparse Adam-Like Algorithms

Beyond implicit sparsity, explicit sparse-Adam variants incorporate regularization or proximal steps to enforce parameter-level, group-level, or mixed sparsity structures during optimization.

AdaBreg: AdaBreg (Bungert et al., 2021) generalizes Adam via Bregman-proximal (mirror descent) steps tailored to sparsifying regularizers, especially the elastic net. The update involves

Bregman-mirrored moment estimates,
A dual variable $v^k$ updated via Adam-type preconditioning,
Recovery of sparsified parameters via the proximal map $\theta^{k+1} = \operatorname{prox}_{\delta J}(\delta v^{k+1})$ .

AdaBreg naturally supports “inverse scale space” regrowth, wherein parameters are zero-initialized and become active only if their dual-accumulated gradients exceed the regularizer threshold. Empirical studies demonstrate that AdaBreg finds models with $2$– $15\%$ of the total parameters, with only $1$– $3\%$ accuracy loss compared to dense Adam on standard computer vision benchmarks.

Group Adam: Group Adam (Yue et al., 2021) extends Adam to directly optimize objectives with sparse-group Lasso penalties, inducing both elementwise ( $\ell_1$ ) and groupwise ( $\ell_2$ norm over parameter groups) sparsity. Its key algorithmic steps include updating an FTRL-style dual variable, elementwise soft-thresholding, and closed-form per-group shrinkage:

$x_{t+1}^g = \left(\frac{\sqrt{\hat{V}_t}+\epsilon I}{\alpha} + 2\lambda_2 I\right)^{-1} \max\left(1 - \frac{\sqrt{d_g}\lambda_{21}}{\|s_t^g\|_2}, 0\right) s_t^g$

This approach yields highly sparse models during training without the need for post hoc pruning.

4. Theoretical Analysis of Sparse Adam Dynamics

The doubly-exponential vanishing of inactive channel weights in Adam (with ReLU and $L_2$ decay) is mathematically formalized by analyzing the recurrence $g_t = \lambda w_t$ . The only stable fixed point is $w_\ast=0$ , and the doubly-exponential convergence assures rapid reduction below any practical threshold.

For explicit sparse Adam variants, convergence guarantees are established under convexity and smoothness assumptions. AdaBreg and Group Adam attain $O(\sqrt{T})$ regret bounds, preserving the convergence properties of the classical Adam algorithm in the presence of structured sparsity regularizers.

5. Empirical Results and Model Compression

Empirical benchmarks consistently demonstrate the efficacy of Adam-based sparsity. In fully connected and convolutional neural networks trained on MNIST and CIFAR-10 (Yaguchi et al., 2018):

Adam with ReLU and $L_2$ decay yields $70\%$ hidden unit sparsity on MNIST, and $53.2\%$ channel sparsity on VGG-style CIFAR-10 convnets, with negligible accuracy loss.
RMSProp and Adam (with $L_2$ decay) show similar channel pruning; SGD or removing weight decay abolishes this effect.
Post-training channel pruning achieves drastic reductions in parameter count and FLOPs (e.g., $3\times$ reduction on MNIST, $50\%$ on CIFAR-10).
Compared to explicit $L_{2,1}$ group-sparsity methods, Adam’s implicit method achieves equivalent or better accuracy–reduction Pareto curves with much less parameter tuning.

For Group Adam (Yue et al., 2021), high sparsity levels (down to $1.8\%$ of embeddings) are reached in extreme-scale CTR models with AUCs matching or exceeding dense Adam plus magnitude pruning.

6. Implementation Guidelines and Practical Considerations

Practical recommendations for exploiting Adam-induced or Adam-regularized sparsity include:

Use Adam (or RMSProp) with moderate $L_2$ decay ( $\lambda \in [10^{-4}, 10^{-3}]$ ).
Prefer ReLU/ELU activations; leaky ReLU suppresses channel sparsity.
Prune channels or groups based on $L_2$ norm after convergence, with post hoc threshold $\xi \sim 10^{-15}$ .
No additional fine-tuning is usually needed, but marginal accuracy can be restored if desired.
For explicit sparsity, deploy AdaBreg or Group Adam with suitable regularization hyperparameters tuned via grid search.

7. Comparative Summary of Sparse Adam Variants

Algorithm	Implicit/Explicit	Sparsity Mechanism	Empirical Outcome
Adam + ReLU + $L_2$	Implicit	Channel sparsity via fast decay of silent neurons	$70\%$ – $85\%$ zero channels, state-of-the-art accuracy
AdaBreg	Explicit	Proximal steps for elastic-net/group sparsity	$2$– $15\%$ params, $1$– $3\%$ acc. loss vs. dense Adam
Group Adam	Explicit	Sparse-group Lasso via closed-form group update	$1.8\%$ embeddings, SOTA AUC, no pruning needed

The observed phenomena unify adaptive optimization, ReLU-induced inactivity, and $L_2$ decay into a comprehensive framework for model slimming and resource-efficient network design, with theoretical and practical advantages over both classical gradient methods and explicit sparsity-inducing regularizers (Yaguchi et al., 2018, Bungert et al., 2021, Yue et al., 2021, Kingma et al., 2014).

Markdown Upgrade to Chat

References (4)

Adam: A Method for Stochastic Optimization (2014)

Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks (2018)

A Bregman Learning Framework for Sparse Neural Networks (2021)

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Adam.