Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoSparse: Gradient Annealing & Learnable Thresholds

Updated 5 March 2026
  • The paper introduces AutoSparse, which leverages gradient annealing and learnable thresholds to stabilize training while achieving high sparsity.
  • It employs dynamic, per-layer thresholds that adapt masking boundaries to allocate non-uniform sparsity based on parameter importance.
  • The method uses a non-linear annealing schedule for gradients, enabling weight recoverability and significantly reducing computational costs with minimal accuracy loss.

Gradient Annealing and Learnable Thresholds (AutoSparse) refers to an automated sparse training regime for @@@@1@@@@ in which network parameters are selectively zeroed by masking mechanisms with thresholds that are themselves learned, while the gradients of masked weights are annealed via non-linear schedules, thus enabling stable, high-quality sparse models without additional sparsity-inducing regularization (Kundu et al., 2023).

1. Motivation and Sparse Training Paradigm

Sparse training seeks to reduce the computational cost of both training and inference by masking out a substantial fraction of weights via pruning. Traditional approaches typically enforce uniform sparsity or rely on static rules, but recent advances leverage two central building blocks:

  • Learnable thresholds: Per-layer (or per-group) trainable parameters that dynamically determine masking boundaries, supporting non-uniform, adaptively allocated sparsity distributions that reflect the variable importance of distinct network regions.
  • Gradient Annealing (GA): A proxy gradient mechanism that replaces the zero gradient for masked weights with a scaled version, allowing continued (though suppressed) optimization for pruned weights.

The combination manages a trade-off between inducing high sparsity and maintaining accuracy. In particular, GA mitigates the “runaway sparsity” problem, where weights pruned early on cannot recover, by enabling their gradients to support reactivation if later found important. The schedule for gradient scaling, annealed to zero over training, helps the optimization process strike a balance: high sparsity early with weight recoverability, converging eventually to strict sparsity.

2. Mathematical Foundations

Let wiw_i denote a scalar parameter in layer \ell at iteration tt. The masking process is governed by:

  • Binary mask: mi(t)=1[wiτ(t)]m_i(t) = 1_{[|w_i| \geq \tau_\ell(t)]}, where τ(t)\tau_\ell(t) is the (learnable) threshold for layer \ell.
  • Masked weight: w~i(t)=sign(wi(t))hα(wi(t)τ(t))\tilde w_i(t) = \mathrm{sign}(w_i(t)) \cdot h_\alpha(|w_i(t)| - \tau_\ell(t)), where hα(x)=xh_\alpha(x) = x if x>0x > 0, $0$ otherwise.

During backpropagation, the proxy-gradient for masked weights is hαx={1amp;xgt;0 αamp;x0\frac{\partial h_\alpha}{\partial x} = \begin{cases} 1 & x > 0 \ \alpha & x \leq 0 \end{cases}withα[0,1]\alpha \in [0,1] controlling the masked-weight gradient scale.

Annealing schedule: α\alpha is replaced by a time-varying function s(t)s(t), decaying nonlinearly with training progress tt. Sigmoid, cosine, and a combined sigmoid–cosine decay schedules are used:

  • Cosine: scos(t)=12[1+cos(πt/T)]s_\text{cos}(t) = \frac{1}{2}[1 + \cos(\pi t / T)]
  • Sigmoid: ssigm(t)=1sigmoid(L0+(L1L0)t/T)s_\text{sigm}(t) = 1 - \mathrm{sigmoid}(L_0 + (L_1 - L_0)t/T)
  • Combined: s(t)=max{scos(t),ssigm(t)}s(t) = \max\{ s_\text{cos}(t), s_\text{sigm}(t) \}

A slow initial decay followed by a steeper decline late in training avoids abrupt sparsity transitions. For masked weights (mi(t)=0m_i(t)=0), the gradient update is wiGA=s(t)wiorig\nabla_{w_i}^\text{GA} = s(t)\, \nabla_{w_i}^\text{orig}.

3. Learnable Thresholds for Adaptive Sparsity

Each layer \ell has a scalar trainable score ss_\ell. The masking threshold is τ=g(s)\tau_\ell = g(s_\ell), typically with g()=sigmoid()g(\cdot) = \mathrm{sigmoid}(\cdot). Masking follows mi=1wiτm_i = 1_{|w_i| \geq \tau_\ell}. The smoothness of gg permits gradient-based optimization with the chain rule:

Losss=g(s)isign(wi)Lossw~iI[wiτ]\frac{\partial \mathrm{Loss}}{\partial s_\ell} = -g'(s_\ell) \sum_{i \in \ell} \mathrm{sign}(w_i) \frac{\partial \mathrm{Loss}}{\partial \tilde w_i} \cdot I[|w_i| \approx \tau_\ell]

No auxiliary L1L_1 or L0L_0 regularizers are required; GA inherent regularization steers the sparsity–accuracy balance.

4. AutoSparse Algorithmic Synthesis

The AutoSparse algorithm orchestrates gradient annealing and learnable thresholds in a unified workflow. Let θ={W,{s}}\theta = \{W, \{s_\ell\}\} be all parameters. The overall loss is

L(θ)=data(Sh,sigmoid(W,s);D)+λW2+λs2L(\theta) = \ell_\text{data}(S_{h, \mathrm{sigmoid}}(W, s); D) + \lambda \|W\|^2 + \lambda \|s\|^2

where only weight decay applies to ss. The core steps include:

  1. For each layer, compute τ=sigmoid(s)\tau_\ell = \mathrm{sigmoid}(s_\ell).
  2. Compute masked weights W^=sign(W)hα(Wτ)\hat{W}_\ell = \mathrm{sign}(W_\ell) \cdot h_\alpha(|W_\ell| - \tau_\ell).
  3. Forward and backward passes using masked weights and chain rule for ss_\ell, scaling gradients according to mask and s(t)s(t).
  4. SGD updates for WW and ss.

As α0\alpha \to 0 at the end of training, the model becomes strictly sparse, without further recovery for masked weights.

5. Implementation Details and Hyperparameters

Experiments utilize ResNet50 and MobileNetV1 on ImageNet-1K. Key settings include:

  • Optimizer: SGD with momentum 0.875
  • Batch size: 256; epochs: 100; 5-epoch warm-up
  • Initial learning rate: 0.256; cosine decay
  • Weight decay: λ=3.0518×105\lambda = 3.0518 \times 10^{-5}
  • Threshold initialization: s(0)=5s_\ell(0) = -5 (so τ0.0067\tau_\ell \approx 0.0067)
  • Gradient annealing: Sigmoid–cosine schedule with L0=6,L1=+6L_0 = -6, L_1 = +6, α(0)=1\alpha(0) = 1, α(T)=0\alpha(T) = 0
  • Optionally, set α=0\alpha = 0 after epoch 70–90 to fully exploit backward sparsity

This configuration enables early achievement and maintenance of high sparsity without explicit sparsity regularization.

6. Performance Evaluation

AutoSparse was benchmarked against learnable-threshold methods (STR [SoftThreshold 2020], DST [DynamicSparse 2020]), sparse-to-sparse methods (RigL, TopKAST, MEST), and structured pruning baselines on standard architectures/datasets.

ResNet50 on ImageNet-1K, 80% Sparsity

Method Top-1 (sparse) Accuracy Drop (%) Sparsity (%) Train FLOPS Inference FLOPS
RigL (80%) 74.6 2.9 80.0 0.33 0.22
TopKAST (u80) 75.7 0.9 80.0 0.48 0.22
STR (learned) 76.19 1.06 79.6 0.54 0.18
AutoSparse 76.77 0.31 79.7 0.51 0.14

At comparable sparsity rates, AutoSparse yields only a 0.31% Top-1 drop, reducing inference FLOPS by 86% and training FLOPS by 49%. Relative to MEST (uniform-80%), AutoSparse matches accuracy with 12% less training and 50% less inference FLOPS.

MobileNetV1 on ImageNet-1K

Method Top-1 (sparse) Accuracy Drop (%) Sparsity (%) Train FLOPS Test FLOPS
STR 68.35 5.00 75.3 0.43 0.18
AutoSparse 70.10 2.57 75.1 0.53 0.21

AutoSparse demonstrates superior Top-1 accuracy relative to STR at nearly identical sparsity.

7. Discussion, Limitations, and Future Directions

AutoSparse leverages gradient annealing to eliminate the requirement for explicit sparsity regularizers. The annealing schedule autonomously determines pruning timing and extent. Learnable thresholds facilitate non-uniform sparsity allocation, optimizing inference efficiency. GA stabilizes training by permitting recovery of mis-pruned weights, thereby suppressing “runaway” divergence.

For extreme sparsity (>90%), sparse-to-sparse frameworks like RigL and MEST can yield matching accuracy but incur significantly greater training costs (2–5x epochs, thus higher FLOPS). AutoSparse achieves and sustains target sparsity early, maintaining efficiency throughout.

Limitations center on the requirement for new hyperparameters (initial α0\alpha_0 and the annealing schedule). Automated selection of these, potentially via warm-start reference losses, remains open for investigation. The interaction between gradient annealing and learning rate schedules, as well as extensions to structured sparsity and transformer architectures, represent further research directions.

In sum, AutoSparse constitutes a fully automated, end-to-end sparse training scheme combining gradient annealing and adaptive, learnable masking thresholds to efficiently discover high-quality sparse models with significant reductions in both training and inference computational demand, incurring minimal accuracy degradation (Kundu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Annealing and Learnable Thresholds (AutoSparse).