AutoSparse: Gradient Annealing & Learnable Thresholds
- The paper introduces AutoSparse, which leverages gradient annealing and learnable thresholds to stabilize training while achieving high sparsity.
- It employs dynamic, per-layer thresholds that adapt masking boundaries to allocate non-uniform sparsity based on parameter importance.
- The method uses a non-linear annealing schedule for gradients, enabling weight recoverability and significantly reducing computational costs with minimal accuracy loss.
Gradient Annealing and Learnable Thresholds (AutoSparse) refers to an automated sparse training regime for @@@@1@@@@ in which network parameters are selectively zeroed by masking mechanisms with thresholds that are themselves learned, while the gradients of masked weights are annealed via non-linear schedules, thus enabling stable, high-quality sparse models without additional sparsity-inducing regularization (Kundu et al., 2023).
1. Motivation and Sparse Training Paradigm
Sparse training seeks to reduce the computational cost of both training and inference by masking out a substantial fraction of weights via pruning. Traditional approaches typically enforce uniform sparsity or rely on static rules, but recent advances leverage two central building blocks:
- Learnable thresholds: Per-layer (or per-group) trainable parameters that dynamically determine masking boundaries, supporting non-uniform, adaptively allocated sparsity distributions that reflect the variable importance of distinct network regions.
- Gradient Annealing (GA): A proxy gradient mechanism that replaces the zero gradient for masked weights with a scaled version, allowing continued (though suppressed) optimization for pruned weights.
The combination manages a trade-off between inducing high sparsity and maintaining accuracy. In particular, GA mitigates the “runaway sparsity” problem, where weights pruned early on cannot recover, by enabling their gradients to support reactivation if later found important. The schedule for gradient scaling, annealed to zero over training, helps the optimization process strike a balance: high sparsity early with weight recoverability, converging eventually to strict sparsity.
2. Mathematical Foundations
Let denote a scalar parameter in layer at iteration . The masking process is governed by:
- Binary mask: , where is the (learnable) threshold for layer .
- Masked weight: , where if , $0$ otherwise.
During backpropagation, the proxy-gradient for masked weights is with controlling the masked-weight gradient scale.
Annealing schedule: is replaced by a time-varying function , decaying nonlinearly with training progress . Sigmoid, cosine, and a combined sigmoid–cosine decay schedules are used:
- Cosine:
- Sigmoid:
- Combined:
A slow initial decay followed by a steeper decline late in training avoids abrupt sparsity transitions. For masked weights (), the gradient update is .
3. Learnable Thresholds for Adaptive Sparsity
Each layer has a scalar trainable score . The masking threshold is , typically with . Masking follows . The smoothness of permits gradient-based optimization with the chain rule:
No auxiliary or regularizers are required; GA inherent regularization steers the sparsity–accuracy balance.
4. AutoSparse Algorithmic Synthesis
The AutoSparse algorithm orchestrates gradient annealing and learnable thresholds in a unified workflow. Let be all parameters. The overall loss is
where only weight decay applies to . The core steps include:
- For each layer, compute .
- Compute masked weights .
- Forward and backward passes using masked weights and chain rule for , scaling gradients according to mask and .
- SGD updates for and .
As at the end of training, the model becomes strictly sparse, without further recovery for masked weights.
5. Implementation Details and Hyperparameters
Experiments utilize ResNet50 and MobileNetV1 on ImageNet-1K. Key settings include:
- Optimizer: SGD with momentum 0.875
- Batch size: 256; epochs: 100; 5-epoch warm-up
- Initial learning rate: 0.256; cosine decay
- Weight decay:
- Threshold initialization: (so )
- Gradient annealing: Sigmoid–cosine schedule with , ,
- Optionally, set after epoch 70–90 to fully exploit backward sparsity
This configuration enables early achievement and maintenance of high sparsity without explicit sparsity regularization.
6. Performance Evaluation
AutoSparse was benchmarked against learnable-threshold methods (STR [SoftThreshold 2020], DST [DynamicSparse 2020]), sparse-to-sparse methods (RigL, TopKAST, MEST), and structured pruning baselines on standard architectures/datasets.
ResNet50 on ImageNet-1K, 80% Sparsity
| Method | Top-1 (sparse) | Accuracy Drop (%) | Sparsity (%) | Train FLOPS | Inference FLOPS |
|---|---|---|---|---|---|
| RigL (80%) | 74.6 | 2.9 | 80.0 | 0.33 | 0.22 |
| TopKAST (u80) | 75.7 | 0.9 | 80.0 | 0.48 | 0.22 |
| STR (learned) | 76.19 | 1.06 | 79.6 | 0.54 | 0.18 |
| AutoSparse | 76.77 | 0.31 | 79.7 | 0.51 | 0.14 |
At comparable sparsity rates, AutoSparse yields only a 0.31% Top-1 drop, reducing inference FLOPS by 86% and training FLOPS by 49%. Relative to MEST (uniform-80%), AutoSparse matches accuracy with 12% less training and 50% less inference FLOPS.
MobileNetV1 on ImageNet-1K
| Method | Top-1 (sparse) | Accuracy Drop (%) | Sparsity (%) | Train FLOPS | Test FLOPS |
|---|---|---|---|---|---|
| STR | 68.35 | 5.00 | 75.3 | 0.43 | 0.18 |
| AutoSparse | 70.10 | 2.57 | 75.1 | 0.53 | 0.21 |
AutoSparse demonstrates superior Top-1 accuracy relative to STR at nearly identical sparsity.
7. Discussion, Limitations, and Future Directions
AutoSparse leverages gradient annealing to eliminate the requirement for explicit sparsity regularizers. The annealing schedule autonomously determines pruning timing and extent. Learnable thresholds facilitate non-uniform sparsity allocation, optimizing inference efficiency. GA stabilizes training by permitting recovery of mis-pruned weights, thereby suppressing “runaway” divergence.
For extreme sparsity (>90%), sparse-to-sparse frameworks like RigL and MEST can yield matching accuracy but incur significantly greater training costs (2–5x epochs, thus higher FLOPS). AutoSparse achieves and sustains target sparsity early, maintaining efficiency throughout.
Limitations center on the requirement for new hyperparameters (initial and the annealing schedule). Automated selection of these, potentially via warm-start reference losses, remains open for investigation. The interaction between gradient annealing and learning rate schedules, as well as extensions to structured sparsity and transformer architectures, represent further research directions.
In sum, AutoSparse constitutes a fully automated, end-to-end sparse training scheme combining gradient annealing and adaptive, learnable masking thresholds to efficiently discover high-quality sparse models with significant reductions in both training and inference computational demand, incurring minimal accuracy degradation (Kundu et al., 2023).