Rethinking Weight Decay For Efficient Neural Network Pruning (2011.10520v4)

Published 20 Nov 2020 in cs.NE

Abstract: Introduced in the late 1980s for generalization purposes, pruning has now become a staple for compressing deep neural networks. Despite many innovations in recent decades, pruning approaches still face core issues that hinder their performance or scalability. Drawing inspiration from early work in the field, and especially the use of weight decay to achieve sparsity, we introduce Selective Weight Decay (SWD), which carries out efficient, continuous pruning throughout training. Our approach, theoretically grounded on Lagrangian smoothing, is versatile and can be applied to multiple tasks, networks, and pruning structures. We show that SWD compares favorably to state-of-the-art approaches, in terms of performance-to-parameters ratio, on the CIFAR-10, Cora, and ImageNet ILSVRC2012 datasets.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces Selective Weight Decay (SWD) as a novel regularization technique to gradually induce sparsity during training.
It utilizes Lagrangian smoothing and an exponential increase in penalization to minimize performance drops at high pruning targets.
Experiments demonstrate SWD outperforms traditional methods by eliminating the need for fine-tuning and adapting to diverse network architectures.

Rethinking Weight Decay for Efficient Neural Network Pruning

The paper "Rethinking Weight Decay For Efficient Neural Network Pruning" provides a novel approach to neural network pruning, emphasizing a technique coined Selective Weight Decay (SWD). This method leverages weight decay principles, enhancing them for efficient and continuous pruning during training. The work is grounded in the concept of Lagrangian smoothing, allowing for versatile application across tasks, networks, and pruning structures.

SWD Methodology

Principle of SWD

Selective Weight Decay (SWD) is a differentiable regularization technique designed to induce sparsity by progressively penalizing certain weights during training according to predefined criteria. The key criterion used in the paper is weight magnitude, a well-established proxy for parameter contribution to a network’s performance.

Lagrangian Smoothing: SWD effectively acts as a Lagrangian smoothing mechanism, enabling gradual parameter reduction to negligible values without significant performance drops. This involves modifying the regular weight decay to apply selectively to weights below a specific threshold defined by the pruning target.
Exponential Increase: The penalization factor increases exponentially during the training process, which ensures that sparsity is induced only after allowing the network to learn sufficiently. This strategy minimizes the adverse impacts of sudden weight changes on learning efficiency.

Implementation Steps

SWD involves the following steps:

Define the initial and maximum penalization strengths, a_{min} and a_{max}.
Regularize the network with weight decay, applying stronger decay selectively to weights below the dynamic threshold.
Continuously update the penalization factor between a_{min} and a_{max} over the training lifecycle.

Adaptability and Flexibility

SWD is versatile, applying to different levels of network structures (e.g., individual weights, channels) and accommodating different network types beyond classic CNNs, such as Graph Convolutional Networks.

Practical Implementation Considerations

Computational Cost

SWD increases computation time by approximately 40-50% per epoch, a moderate cost compared to the substantial time savings from bypassing additional fine-tuning phases post-pruning, common in traditional methods.

Hyperparameter Sensitivity

The efficacy of SWD is significantly influenced by the choice of its hyperparameters, a_{min} and a_{max}, which regulate the penalization intensity. Thorough hyperparameter tuning is advised to accommodate specific datasets and network architectures.

Experimental Evaluation

The experiments demonstrate SWD's superior performance, notably under high pruning targets. The paper highlights comparisons with several baseline pruning methodologies, including magnitude pruning and LR-Rewinding, using standardized datasets like CIFAR-10 and ImageNet.

Significant findings include:

SWD's consistent outperformance at aggressive pruning levels, where other methods suffer from abrupt performance drops.
No need for fine-tuning after pruning, unlike iterative traditional methodologies.
Successful application to diverse network architectures, including GCNs on non-visual tasks, affirming SWD's broad applicability.

Conclusion

Selective Weight Decay offers a theoretical and practical advancement in neural network pruning, promoting efficiency and adaptability with minimal post-processing overhead. Its integration into broader deep learning practices promises enhanced scalability and flexibility. Future explorations could refine its hyperparameters determination methods and extend its utility across emergent neural architectures and novel domains.