Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training (2102.02887v3)

Published 4 Feb 2021 in cs.LG, cs.AI, and cs.CV

Abstract: In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%). Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.

PDF Abstract

In-Time Over-Parameterization in Sparse Training: A Detailed Analysis

The research paper "Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training" introduces In-Time Over-Parameterization (ITOP) as a novel perspective on training deep neural networks. This concept challenges the traditional requirement of dense over-parameterization for achieving state-of-the-art performance in deep learning models. The paper critiques the current methods that heavily depend on dense architectures and extensive computational resources and presents ITOP as an efficient alternative for sparse training.

Background and Motivation

Over-parameterization is widely accepted as a crucial factor in the exceptional performance of deep neural networks, despite the inherent non-convex and non-smooth nature of the training objectives. The significant resource requirement in training and deploying such over-parameterized models remains a bottleneck, especially as models grow in size, evidenced by the costly training regimes of models like GPT-3 and Vision Transformers. Sparse training methods aim to address this barrier by maintaining competitive performance with considerably reduced parameter counts.

In-Time Over-Parameterization (ITOP)

ITOP emerges from the need to reconcile the gap in expressibility between sparse and dense trainings. It leverages the training period to dynamically explore sparse connectivities, thereby mimicking the comprehensive parameter search akin to dense networks over time. Unlike static or iterative pruning methods that traditionally rely on dense pre-trained models, ITOP initiates training with random sparse networks and continuously adjusts connections during training.

Dynamic Sparse Training (DST) and ITOP

The concept of ITOP is instrumental in understanding the mechanics of Dynamic Sparse Training (DST). The core advantage of DST is its capability to traverse multiple possible parameter configurations throughout the training, searching for optimal sparse connectivity. Experimental results indicate that DST, when equipped with ITOP, can outperform dense networks by substantial margins, especially in cases of extreme sparsity. For example, achieving comparative performance with a dense model at 98% sparsity was demonstrated using ResNet-34 on CIFAR-100.

Implications and Future Research Directions

Enhanced Sparse Expressibility: ITOP significantly bridges the expressibility gap inherent in sparse training. It offers a viable pathway to match and in some cases exceed dense network performance with fewer parameters.
Cost Reduction: By eliminating the need for dense over-parameterization, ITOP reduces training and inference costs, making state-of-the-art models accessible to a broader community.
Understanding Sparse Dynamics: ITOP serves as a basis for exploring the sparse learning landscape, offering insights into the interplay between connectivity and performance.
Generalization and Overfitting: ITOP's influence in improving model generalization and curbing overfitting further substantiates its efficacy since sparse models under ITOP continue to excel even with extended training durations.

Methodological Highlights

Sparse Exploration: ITOP makes a compelling case for dynamic exploration versus static pruning, hypothesizing that the exploration of parameters throughout training accounts for DST's superior performance.
Practical Experiments: The researchers ran extensive experiments across different neural architectures (e.g., MLP, VGG-16, ResNet-34), demonstrating ITOP's proficiency in achieving high accuracies at various sparsities with decreased FLOPs compared to traditional dense models.

Conclusion

This paper fundamentally challenges the notion that dense over-parameterization is the only pathway to high-performing neural networks. With ITOP, sparse training can achieve competitive and often superior results, highlighting a pivotal shift in resource-efficient deep learning methodologies. Future work may delve into optimizing ITOP for broader neural architectures and integrating it with other sparsity-inducing strategies to further enhance performance and reduce computational overhead.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shiwei Liu (76 papers)
Lu Yin (85 papers)
Decebal Constantin Mocanu (52 papers)
Mykola Pechenizkiy (118 papers)

Citations (117)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Shiweiliuiiiiiii/In-Time-Over-Parameterization: [ICML 2021] "Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training" by Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, Mykola Pechenizkiy (47 stars)