AutoSparse: Gradient Annealing & Learnable Thresholds

Updated 5 March 2026

The paper introduces AutoSparse, which leverages gradient annealing and learnable thresholds to stabilize training while achieving high sparsity.
It employs dynamic, per-layer thresholds that adapt masking boundaries to allocate non-uniform sparsity based on parameter importance.
The method uses a non-linear annealing schedule for gradients, enabling weight recoverability and significantly reducing computational costs with minimal accuracy loss.

Gradient Annealing and Learnable Thresholds (AutoSparse) refers to an automated sparse training regime for @@@@1@@@@ in which network parameters are selectively zeroed by masking mechanisms with thresholds that are themselves learned, while the gradients of masked weights are annealed via non-linear schedules, thus enabling stable, high-quality sparse models without additional sparsity-inducing regularization (Kundu et al., 2023).

1. Motivation and Sparse Training Paradigm

Sparse training seeks to reduce the computational cost of both training and inference by masking out a substantial fraction of weights via pruning. Traditional approaches typically enforce uniform sparsity or rely on static rules, but recent advances leverage two central building blocks:

Learnable thresholds: Per-layer (or per-group) trainable parameters that dynamically determine masking boundaries, supporting non-uniform, adaptively allocated sparsity distributions that reflect the variable importance of distinct network regions.
Gradient Annealing (GA): A proxy gradient mechanism that replaces the zero gradient for masked weights with a scaled version, allowing continued (though suppressed) optimization for pruned weights.

The combination manages a trade-off between inducing high sparsity and maintaining accuracy. In particular, GA mitigates the “runaway sparsity” problem, where weights pruned early on cannot recover, by enabling their gradients to support reactivation if later found important. The schedule for gradient scaling, annealed to zero over training, helps the optimization process strike a balance: high sparsity early with weight recoverability, converging eventually to strict sparsity.

2. Mathematical Foundations

Let $w_i$ denote a scalar parameter in layer $\ell$ at iteration $t$ . The masking process is governed by:

Binary mask: $m_i(t) = 1_{[|w_i| \geq \tau_\ell(t)]}$ , where $\tau_\ell(t)$ is the (learnable) threshold for layer $\ell$ .
Masked weight: $\tilde w_i(t) = \mathrm{sign}(w_i(t)) \cdot h_\alpha(|w_i(t)| - \tau_\ell(t))$ , where $h_\alpha(x) = x$ if $x > 0$ , $0$ otherwise.

During backpropagation, the proxy-gradient for masked weights is $\frac{\partial h_\alpha}{\partial x} = \begin{cases} 1 & x > 0 \ \alpha & x \leq 0 \end{cases}$ with $\alpha \in [0,1]$ controlling the masked-weight gradient scale.

Annealing schedule: $\alpha$ is replaced by a time-varying function $s(t)$ , decaying nonlinearly with training progress $t$ . Sigmoid, cosine, and a combined sigmoid–cosine decay schedules are used:

Cosine: $s_\text{cos}(t) = \frac{1}{2}[1 + \cos(\pi t / T)]$
Sigmoid: $s_\text{sigm}(t) = 1 - \mathrm{sigmoid}(L_0 + (L_1 - L_0)t/T)$
Combined: $s(t) = \max\{ s_\text{cos}(t), s_\text{sigm}(t) \}$

A slow initial decay followed by a steeper decline late in training avoids abrupt sparsity transitions. For masked weights ( $m_i(t)=0$ ), the gradient update is $\nabla_{w_i}^\text{GA} = s(t)\, \nabla_{w_i}^\text{orig}$ .

3. Learnable Thresholds for Adaptive Sparsity

Each layer $\ell$ has a scalar trainable score $s_\ell$ . The masking threshold is $\tau_\ell = g(s_\ell)$ , typically with $g(\cdot) = \mathrm{sigmoid}(\cdot)$ . Masking follows $m_i = 1_{|w_i| \geq \tau_\ell}$ . The smoothness of $g$ permits gradient-based optimization with the chain rule:

$\frac{\partial \mathrm{Loss}}{\partial s_\ell} = -g'(s_\ell) \sum_{i \in \ell} \mathrm{sign}(w_i) \frac{\partial \mathrm{Loss}}{\partial \tilde w_i} \cdot I[|w_i| \approx \tau_\ell]$

No auxiliary $L_1$ or $L_0$ regularizers are required; GA inherent regularization steers the sparsity–accuracy balance.

4. AutoSparse Algorithmic Synthesis

The AutoSparse algorithm orchestrates gradient annealing and learnable thresholds in a unified workflow. Let $\theta = \{W, \{s_\ell\}\}$ be all parameters. The overall loss is

$L(\theta) = \ell_\text{data}(S_{h, \mathrm{sigmoid}}(W, s); D) + \lambda \|W\|^2 + \lambda \|s\|^2$

where only weight decay applies to $s$ . The core steps include:

For each layer, compute $\tau_\ell = \mathrm{sigmoid}(s_\ell)$ .
Compute masked weights $\hat{W}_\ell = \mathrm{sign}(W_\ell) \cdot h_\alpha(|W_\ell| - \tau_\ell)$ .
Forward and backward passes using masked weights and chain rule for $s_\ell$ , scaling gradients according to mask and $s(t)$ .
SGD updates for $W$ and $s$ .

As $\alpha \to 0$ at the end of training, the model becomes strictly sparse, without further recovery for masked weights.

5. Implementation Details and Hyperparameters

Experiments utilize ResNet50 and MobileNetV1 on ImageNet-1K. Key settings include:

Optimizer: SGD with momentum 0.875
Batch size: 256; epochs: 100; 5-epoch warm-up
Initial learning rate: 0.256; cosine decay
Weight decay: $\lambda = 3.0518 \times 10^{-5}$
Threshold initialization: $s_\ell(0) = -5$ (so $\tau_\ell \approx 0.0067$ )
Gradient annealing: Sigmoid–cosine schedule with $L_0 = -6, L_1 = +6$ , $\alpha(0) = 1$ , $\alpha(T) = 0$
Optionally, set $\alpha = 0$ after epoch 70–90 to fully exploit backward sparsity

This configuration enables early achievement and maintenance of high sparsity without explicit sparsity regularization.

6. Performance Evaluation

AutoSparse was benchmarked against learnable-threshold methods (STR [SoftThreshold 2020], DST [DynamicSparse 2020]), sparse-to-sparse methods (RigL, TopKAST, MEST), and structured pruning baselines on standard architectures/datasets.

ResNet50 on ImageNet-1K, 80% Sparsity

Method	Top-1 (sparse)	Accuracy Drop (%)	Sparsity (%)	Train FLOPS	Inference FLOPS
RigL (80%)	74.6	2.9	80.0	0.33	0.22
TopKAST (u80)	75.7	0.9	80.0	0.48	0.22
STR (learned)	76.19	1.06	79.6	0.54	0.18
AutoSparse	76.77	0.31	79.7	0.51	0.14

At comparable sparsity rates, AutoSparse yields only a 0.31% Top-1 drop, reducing inference FLOPS by 86% and training FLOPS by 49%. Relative to MEST (uniform-80%), AutoSparse matches accuracy with 12% less training and 50% less inference FLOPS.

MobileNetV1 on ImageNet-1K

Method	Top-1 (sparse)	Accuracy Drop (%)	Sparsity (%)	Train FLOPS	Test FLOPS
STR	68.35	5.00	75.3	0.43	0.18
AutoSparse	70.10	2.57	75.1	0.53	0.21

AutoSparse demonstrates superior Top-1 accuracy relative to STR at nearly identical sparsity.

7. Discussion, Limitations, and Future Directions

AutoSparse leverages gradient annealing to eliminate the requirement for explicit sparsity regularizers. The annealing schedule autonomously determines pruning timing and extent. Learnable thresholds facilitate non-uniform sparsity allocation, optimizing inference efficiency. GA stabilizes training by permitting recovery of mis-pruned weights, thereby suppressing “runaway” divergence.

For extreme sparsity (>90%), sparse-to-sparse frameworks like RigL and MEST can yield matching accuracy but incur significantly greater training costs (2–5x epochs, thus higher FLOPS). AutoSparse achieves and sustains target sparsity early, maintaining efficiency throughout.

Limitations center on the requirement for new hyperparameters (initial $\alpha_0$ and the annealing schedule). Automated selection of these, potentially via warm-start reference losses, remains open for investigation. The interaction between gradient annealing and learning rate schedules, as well as extensions to structured sparsity and transformer architectures, represent further research directions.

In sum, AutoSparse constitutes a fully automated, end-to-end sparse training scheme combining gradient annealing and adaptive, learnable masking thresholds to efficiently discover high-quality sparse models with significant reductions in both training and inference computational demand, incurring minimal accuracy degradation (Kundu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

AUTOSPARSE: Towards Automated Sparse Training of Deep Neural Networks (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Annealing and Learnable Thresholds (AutoSparse).

AutoSparse: Gradient Annealing & Learnable Thresholds

1. Motivation and Sparse Training Paradigm

2. Mathematical Foundations

3. Learnable Thresholds for Adaptive Sparsity

4. AutoSparse Algorithmic Synthesis

5. Implementation Details and Hyperparameters

6. Performance Evaluation

ResNet50 on ImageNet-1K, 80% Sparsity

MobileNetV1 on ImageNet-1K

7. Discussion, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AutoSparse: Gradient Annealing & Learnable Thresholds

1. Motivation and Sparse Training Paradigm

2. Mathematical Foundations

3. Learnable Thresholds for Adaptive Sparsity

4. AutoSparse Algorithmic Synthesis

5. Implementation Details and Hyperparameters

6. Performance Evaluation

ResNet50 on ImageNet-1K, 80% Sparsity

MobileNetV1 on ImageNet-1K

7. Discussion, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research