AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks (2106.12379v2)

Published 23 Jun 2021 in cs.LG and cs.AI

Abstract: The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate. Recent work has investigated the even harder case of sparse training, where the DNN weights are, for as much as possible, already sparse to reduce computational costs during training. Existing sparse training methods are often empirical and can have lower accuracy relative to the dense baseline. In this paper, we present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrate convergence for a variant of the algorithm, and show that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/DC even outperforms existing methods that rely on accurate pre-trained dense models. An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process. This is useful in practice, where compressed variants may be desirable for deployment in resource-constrained settings without re-doing the entire training flow, and also provides us with insights into the accuracy gap between dense and compressed models. The code is available at: https://github.com/IST-DASLab/ACDC .

Citations (61)

View on Semantic Scholar

Summary

The paper presents the AC/DC method that enforces sparsity early in training by alternating between compressed and decompressed model phases.
It establishes a theoretical framework using the Concentrated PL condition and demonstrates superior accuracy with reduced computational cost compared to leading sparse training methods.
The method delivers both sparse and dense models, offering practical benefits for resource-constrained applications and insights into the memorization and robustness of sparse networks.

An Analytical and Practical Approach to Sparse Deep Neural Network Training with the AC/DC Method

This essay presents an expert analysis of the paper titled "AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks," which introduces a novel method for sparse training in deep neural networks (DNNs), a critical area within efficiency-focused neural network research. The text delineates both theoretical underpinnings and practical implementations of the AC/DC method, a strategy aimed at ameliorating the computational burdens associated with DNN training by enforcing a sparseness criterion early in the training process.

Theoretical Contributions of AC/DC

The AC/DC method innovatively builds upon the classic Iterative Hard Thresholding (IHT) approach, a known algorithmic class used in the compressed sensing literature. This paper equips IHT with stochastic updates suitable for DNNs, navigating the nontrivial challenge of ensuring convergence under the uncertainties inherent in stochastic gradient methods. It achieves this by demonstrating a convergence proof based on a specialized form of the Polyak-Lojasiewicz (PL) condition, termed the Concentrated PL (CPL) condition. The CPL condition assumes a concentration of gradient magnitude on a sparse subset of weights, thus providing theoretical assurances for convergence with a sparse solution that meets or exceeds a target spareness threshold.

Empirical Validation and Comparison

Within its empirical evaluation, the paper details comparative results across benchmark datasets (CIFAR-100, ImageNet) and architectures (ResNet50, MobileNetV1, Transformer-XL), against leading sparse training methods such as RigL and Top-KAST. The AC/DC method, notably, shows superior accuracy and computational efficiency, particularly at high sparsity levels, frequently exceeding the performance of state-of-the-art post-training pruning approaches like WoodFisher—especially in high sparseness regimes. The paper argues convincingly for the competitive edge of AC/DC in environments constrained by computational budget, citing approximately $0.53 \times$ the training FLOPs relative to dense baselines for ultra-high sparsity levels on ResNet50, all while exceeding or matching baseline accuracies post fine-tuning dense models.

Practical Implications and Insights

A salient characteristic of the AC/DC approach is its dual output of both sparse and dense models through a comprehensive training process. This holds pragmatic relevance for applications requiring resource-constrained inference and simultaneously provides insights into the dynamics of sparsity and memorization within DNNs. Interestingly, the AC/DC co-training regimen sheds light on the relative inability of sparse networks to memorize random labels compared to their dense counterparts, revealing potential robustness advantages of sparse models—a subject of ongoing investigation in understanding DNN erudition capabilities.

Moreover, AC/DC’s adaptability was validated through experiments with semi-structured sparsity patterns, such as the 2:4 pattern supported by modern GPU architectures, further demonstrating its flexibility in responding to hardware advancements.

Future Perspective

As the findings reveal, AC/DC lies at the influential intersection of efficient training protocols and hardware utilizability. Future exploration could target further optimization of sparse-dense phase lengths and frequency, potentially unlocking further computational savings without degrading model accuracy. Additionally, scaling the method’s application to larger, more complex models within other modalities, including natural language processing, presents a ripe avenue for investigation and contribution to sparsity-driven DNN research.

In summary, this paper substantiates AC/DC as both a theoretically rooted and empirically validated advance in sparse neural network training, effectuating efficiency and accuracy advantages that resonate well with real-world deployment considerations in computationally constricted settings.

PDF Markdown

Related Papers

GitHub

GitHub - IST-DASLab/ACDC: Code for reproducing "AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks" (NeurIPS 2021) (20 stars)

YouTube

Show All Videos