Picking Winning Tickets Before Training by Preserving Gradient Flow (2002.07376v2)

Published 18 Feb 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Overparameterization has been shown to benefit both the optimization and generalization of neural networks, but large networks are resource hungry at both training and test time. Network pruning can reduce test-time resource requirements, but is typically applied to trained networks and therefore cannot avoid the expensive training process. We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP). We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet, using VGGNet and ResNet architectures. Our method can prune 80% of the weights of a VGG-16 network on ImageNet at initialization, with only a 1.6% drop in top-1 accuracy. Moreover, our method achieves significantly better performance than the baseline at extreme sparsity levels.

PDF Abstract

Understanding "Picking Winning Tickets Before Training by Preserving Gradient Flow"

In the field of neural network optimization, the paper “Picking Winning Tickets Before Training by Preserving Gradient Flow” by Wang et al. presents a novel approach to network pruning. This paper grapples with the challenge of reducing the computational resource demands of training overparameterized models without sacrificing their generalization capabilities.

Pruning at Initialization

A key focus of this work is to introduce a pruning methodology that can be applied before the training process rather than after a network has been trained. Typically, pruning is applied post-training, thus negating any potential savings in training-related computation. By contrast, the method proposed, named Gradient Signal Preservation (GraSP), aims to prune networks at initialization, thereby reducing resource consumption during training.

GraSP as a Pruning Criterion

The central concept of GraSP is to maintain or augment the gradient flow within the network post-pruning. This involves identifying and preserving important connections that are crucial for efficient training dynamics, specifically focusing on those weights that sustain the network's gradient signal. GraSP evaluates a weight’s significance through how its removal would impact the network's gradient flow. The method relies on higher-order automatic differentiation to compute the Hessian-gradient product efficiently, which serves as a decisive metric in pruning decisions.

Empirical Validation

The authors carried out extensive experiments using VGGNet and ResNet architectures on datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. Notably, GraSP demonstrated impressive performance by managing to prune up to 80% of VGG-16's weights on ImageNet with a mere 1.6% degradation in top-1 accuracy. Comparisons indicate that GraSP significantly surpasses baseline approaches in extremely sparse scenarios, highlighting its robustness and effectiveness.

Theoretical Implications

A noteworthy theoretical underpinning of this paper is its relationship with the Neural Tangent Kernel (NTK) framework, providing a lens to view and justify the proposed pruning strategy. By aligning the gradient flow preservation with NTK dynamics, the authors affirm that preserved gradient norms favor more efficient training dynamics, suggesting deeper theoretical insights into neural network optimization during sparsification.

Implications and Future Directions

The implications of this paper stretch beyond resource efficiency in training. Pruning at initialization, as enabled by GraSP, opens doors for training models too large to fit in existing hardware, expanding the feasible scope of neural architectures. Additionally, it suggests new research avenues, notably in developing more effective optimizers that can leverage paths from sparse initialization to performative solutions found by traditional pruning.

In conclusion, the research presented by Wang et al. offers substantial advancements in network pruning. By optimizing networks at initialization through a gradient-preserving criterion, they address computational cost concerns while maintaining competitive performance, pointing towards more resource-effective and scalable neural networks for the future.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Chaoqi Wang (16 papers)
Guodong Zhang (41 papers)
Roger Grosse (68 papers)

Citations (558)

View on Semantic Scholar