Accelerating Sparse Deep Neural Networks (2104.08378v1)

Published 16 Apr 2021 in cs.LG, cs.AI, and cs.AR

Abstract: As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero values in parameters that can then be discarded from storage or computations. While most research focuses on high levels of sparsity, there are challenges in universally maintaining model accuracy as well as achieving significant speedups over modern matrix-math hardware. To make sparsity adoption practical, the NVIDIA Ampere GPU architecture introduces sparsity support in its matrix-math units, Tensor Cores. We present the design and behavior of Sparse Tensor Cores, which exploit a 2:4 (50%) sparsity pattern that leads to twice the math throughput of dense matrix units. We also describe a simple workflow for training networks that both satisfy 2:4 sparsity pattern requirements and maintain accuracy, verifying it on a wide range of common tasks and model architectures. This workflow makes it easy to prepare accurate models for efficient deployment on Sparse Tensor Cores.

PDF Abstract

Accelerating Sparse Deep Neural Networks: A Technical Overview

The paper "Accelerating Sparse Deep Neural Networks" addresses a critical challenge in the field of deep learning: the efficient execution of expansive neural network models that often contain hundreds of billions of parameters and demand trillions of computations per input sample. The authors explore sparsity as a means to mitigate these computational burdens by focusing on how zero values in network parameters can be leveraged to reduce storage and computational costs. Particularly, the work concentrates on introducing a specific kind of structured sparsity—known as 2:4 sparsity—and its practical implementation through Sparse Tensor Cores in NVIDIA's Ampere GPU architecture.

Sparse Tensor Cores and 2:4 Sparsity Pattern

The 2:4 sparsity pattern specified in the paper mandates that within any group of four consecutive weights, at least two of them are zero. This configuration results in a structured 50% sparsity across the neural network parameters, allowing for effective compression and storage while retaining model accuracy. Sparse Tensor Cores, part of the NVIDIA Ampere architecture, capitalize on this pattern by doubling math throughput for matrix operations—a significant boon considering that matrix multiplication lies at the heart of neural network computations such as convolutions and linear layers.

This research makes a bold claim: Sparse Tensor Cores can offer a 2x speedup for dense matrix operations by taking advantage of structured sparsity. The efficacy of these cores is contingent upon dimensions such as arithmetic intensity and specific GEMM dimensions. The 2:4 compression format proposed by the authors efficiently reduces storage needs and memory bandwidth, with a noted storage savings of approximately 44% for 16-bit operands and 38% for 8-bit operands.

Practical Workflow for Sparse Network Deployment

To ensure the usability of 2:4 sparsity across diverse neural network architectures and tasks, the paper presents a straightforward workflow:

Dense Training: Train the network without any sparsity constraints to baseline performance.
Pruning: Apply the 2:4 sparsity pattern by eliminating weights according to predefined criteria such as weight magnitude.
Sparse Retraining: Retrain the pruned network while maintaining the sparsity pattern, using the same hyper-parameters and optimizing the model to recover any lost accuracy.

This approach, while doubling the time for training, provides a streamlined and generalizable process for achieving sparse networks deployable on Sparse Tensor Cores without any hyper-parameter adjustments. The authors empirically validate this workflow's effectiveness across a multitude of network architectures, ranging from image classification to natural language processing tasks.

Empirical Insights and Implications

Through rigorous empirical evaluation, the authors demonstrate how the proposed sparsity workflow maintains accuracy across various architectures and tasks, like ResNet, VGG, and BERT, among others. Interestingly, networks with smaller parameter counts benefit from additional permutations before pruning, aiding in the recovery of initial dense model accuracy. Furthermore, the paper acknowledges potential applications for combining this structured sparsity with quantization techniques to further enhance efficiency during inference.

Future Research Directions

The authors suggest several avenues for future research, including the development of reduced training schedules that still leverage sparse tensors and the exploration of advancements in mask-finding algorithms for accelerated training. Additional research could delve into dynamic mask strategies as opposed to static ones used in this paper. Finally, the work encourages investigations into the sparsity potential for activations, aiming towards a more comprehensive approach to network pruning beyond just weights.

In conclusion, "Accelerating Sparse Deep Neural Networks" contributes a valuable perspective on realizing sparse deep learning models that are not only efficient in terms of computational resources but maintain competitive accuracy—a crucial requirement for broader adoption and implementation in real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Asit Mishra (8 papers)
Jorge Albericio Latorre (2 papers)
Jeff Pool (11 papers)
Darko Stosic (7 papers)
Dusan Stosic (12 papers)
Ganesh Venkatesh (14 papers)
Chong Yu (17 papers)
Paulius Micikevicius (9 papers)

Citations (182)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/WesleyYue/status/1811112625952149642