Rethinking Weight Decay For Efficient Neural Network Pruning

Published 20 Nov 2020 in cs.NE | (2011.10520v4)

Abstract: Introduced in the late 1980s for generalization purposes, pruning has now become a staple for compressing deep neural networks. Despite many innovations in recent decades, pruning approaches still face core issues that hinder their performance or scalability. Drawing inspiration from early work in the field, and especially the use of weight decay to achieve sparsity, we introduce Selective Weight Decay (SWD), which carries out efficient, continuous pruning throughout training. Our approach, theoretically grounded on Lagrangian smoothing, is versatile and can be applied to multiple tasks, networks, and pruning structures. We show that SWD compares favorably to state-of-the-art approaches, in terms of performance-to-parameters ratio, on the CIFAR-10, Cora, and ImageNet ILSVRC2012 datasets.

Abstract PDF Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper introduces Selective Weight Decay (SWD) as a novel regularization technique to gradually induce sparsity during training.
It utilizes Lagrangian smoothing and an exponential increase in penalization to minimize performance drops at high pruning targets.
Experiments demonstrate SWD outperforms traditional methods by eliminating the need for fine-tuning and adapting to diverse network architectures.

Rethinking Weight Decay for Efficient Neural Network Pruning

The paper "Rethinking Weight Decay For Efficient Neural Network Pruning" provides a novel approach to neural network pruning, emphasizing a technique coined Selective Weight Decay (SWD). This method leverages weight decay principles, enhancing them for efficient and continuous pruning during training. The work is grounded in the concept of Lagrangian smoothing, allowing for versatile application across tasks, networks, and pruning structures.

SWD Methodology

Principle of SWD

Selective Weight Decay (SWD) is a differentiable regularization technique designed to induce sparsity by progressively penalizing certain weights during training according to predefined criteria. The key criterion used in the study is weight magnitude, a well-established proxy for parameter contribution to a network’s performance.

Lagrangian Smoothing: SWD effectively acts as a Lagrangian smoothing mechanism, enabling gradual parameter reduction to negligible values without significant performance drops. This involves modifying the regular weight decay to apply selectively to weights below a specific threshold defined by the pruning target.
Exponential Increase: The penalization factor increases exponentially during the training process, which ensures that sparsity is induced only after allowing the network to learn sufficiently. This strategy minimizes the adverse impacts of sudden weight changes on learning efficiency.

Implementation Steps

SWD involves the following steps:

Define the initial and maximum penalization strengths, a_{min} and a_{max}.
Regularize the network with weight decay, applying stronger decay selectively to weights below the dynamic threshold.
Continuously update the penalization factor between a_{min} and a_{max} over the training lifecycle.

Adaptability and Flexibility

SWD is versatile, applying to different levels of network structures (e.g., individual weights, channels) and accommodating different network types beyond classic CNNs, such as Graph Convolutional Networks.

Practical Implementation Considerations

Computational Cost

SWD increases computation time by approximately 40-50% per epoch, a moderate cost compared to the substantial time savings from bypassing additional fine-tuning phases post-pruning, common in traditional methods.

Hyperparameter Sensitivity

The efficacy of SWD is significantly influenced by the choice of its hyperparameters, a_{min} and a_{max}, which regulate the penalization intensity. Thorough hyperparameter tuning is advised to accommodate specific datasets and network architectures.

Experimental Evaluation

The experiments demonstrate SWD's superior performance, notably under high pruning targets. The paper highlights comparisons with several baseline pruning methodologies, including magnitude pruning and LR-Rewinding, using standardized datasets like CIFAR-10 and ImageNet.

Significant findings include:

SWD's consistent outperformance at aggressive pruning levels, where other methods suffer from abrupt performance drops.
No need for fine-tuning after pruning, unlike iterative traditional methodologies.
Successful application to diverse network architectures, including GCNs on non-visual tasks, affirming SWD's broad applicability.

Conclusion

Selective Weight Decay offers a theoretical and practical advancement in neural network pruning, promoting efficiency and adaptability with minimal post-processing overhead. Its integration into broader deep learning practices promises enhanced scalability and flexibility. Future explorations could refine its hyperparameters determination methods and extend its utility across emergent neural architectures and novel domains.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Rethinking Weight Decay For Efficient Neural Network Pruning

Summary

Rethinking Weight Decay for Efficient Neural Network Pruning

SWD Methodology

Principle of SWD

Implementation Steps

Adaptability and Flexibility

Practical Implementation Considerations

Computational Cost

Hyperparameter Sensitivity

Experimental Evaluation

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Rethinking Weight Decay For Efficient Neural Network Pruning

Summary

Rethinking Weight Decay for Efficient Neural Network Pruning

SWD Methodology

Principle of SWD

Implementation Steps

Adaptability and Flexibility

Practical Implementation Considerations

Computational Cost

Hyperparameter Sensitivity

Experimental Evaluation

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections