- The paper introduces Selective Weight Decay (SWD) as a novel regularization technique to gradually induce sparsity during training.
- It utilizes Lagrangian smoothing and an exponential increase in penalization to minimize performance drops at high pruning targets.
- Experiments demonstrate SWD outperforms traditional methods by eliminating the need for fine-tuning and adapting to diverse network architectures.
Rethinking Weight Decay for Efficient Neural Network Pruning
The paper "Rethinking Weight Decay For Efficient Neural Network Pruning" provides a novel approach to neural network pruning, emphasizing a technique coined Selective Weight Decay (SWD). This method leverages weight decay principles, enhancing them for efficient and continuous pruning during training. The work is grounded in the concept of Lagrangian smoothing, allowing for versatile application across tasks, networks, and pruning structures.
SWD Methodology
Principle of SWD
Selective Weight Decay (SWD) is a differentiable regularization technique designed to induce sparsity by progressively penalizing certain weights during training according to predefined criteria. The key criterion used in the paper is weight magnitude, a well-established proxy for parameter contribution to a network’s performance.
- Lagrangian Smoothing: SWD effectively acts as a Lagrangian smoothing mechanism, enabling gradual parameter reduction to negligible values without significant performance drops. This involves modifying the regular weight decay to apply selectively to weights below a specific threshold defined by the pruning target.
- Exponential Increase: The penalization factor increases exponentially during the training process, which ensures that sparsity is induced only after allowing the network to learn sufficiently. This strategy minimizes the adverse impacts of sudden weight changes on learning efficiency.
Implementation Steps
SWD involves the following steps:
- Define the initial and maximum penalization strengths,
a_{min} and a_{max}.
- Regularize the network with weight decay, applying stronger decay selectively to weights below the dynamic threshold.
- Continuously update the penalization factor between
a_{min} and a_{max} over the training lifecycle.
Adaptability and Flexibility
SWD is versatile, applying to different levels of network structures (e.g., individual weights, channels) and accommodating different network types beyond classic CNNs, such as Graph Convolutional Networks.
Practical Implementation Considerations
Computational Cost
SWD increases computation time by approximately 40-50% per epoch, a moderate cost compared to the substantial time savings from bypassing additional fine-tuning phases post-pruning, common in traditional methods.
Hyperparameter Sensitivity
The efficacy of SWD is significantly influenced by the choice of its hyperparameters, a_{min} and a_{max}, which regulate the penalization intensity. Thorough hyperparameter tuning is advised to accommodate specific datasets and network architectures.
Experimental Evaluation
The experiments demonstrate SWD's superior performance, notably under high pruning targets. The paper highlights comparisons with several baseline pruning methodologies, including magnitude pruning and LR-Rewinding, using standardized datasets like CIFAR-10 and ImageNet.
Significant findings include:
- SWD's consistent outperformance at aggressive pruning levels, where other methods suffer from abrupt performance drops.
- No need for fine-tuning after pruning, unlike iterative traditional methodologies.
- Successful application to diverse network architectures, including GCNs on non-visual tasks, affirming SWD's broad applicability.
Conclusion
Selective Weight Decay offers a theoretical and practical advancement in neural network pruning, promoting efficiency and adaptability with minimal post-processing overhead. Its integration into broader deep learning practices promises enhanced scalability and flexibility. Future explorations could refine its hyperparameters determination methods and extend its utility across emergent neural architectures and novel domains.