Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks (2102.08124v2)

Published 16 Feb 2021 in cs.AI

Abstract: Unstructured pruning reduces the memory footprint in deep neural networks (DNNs). Recently, researchers proposed different types of structural pruning intending to reduce also the computation complexity. In this work, we first suggest a new measure called mask-diversity which correlates with the expected accuracy of the different types of structural pruning. We focus on the recently suggested N:M fine-grained block sparsity mask, in which for each block of M weights, we have at least N zeros. While N:M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes. Our transposable mask guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated. We formulate the problem of finding the optimal transposable-mask as a minimum-cost flow problem. Additionally, to speed up the minimum-cost flow computation, we also introduce a fast linear-time approximation that can be used when the masks dynamically change during training. Our experiments suggest a 2x speed-up in the matrix multiplications with no accuracy degradation over vision and LLMs. Finally, to solve the problem of switching between different structure constraints, we suggest a method to convert a pre-trained model with unstructured sparsity to an N:M fine-grained block sparsity model with little to no training. A reference implementation can be found at https://github.com/papers-submission/structured_transposable_masks.

PDF Abstract

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

The paper presents a method for accelerating sparse neural network training using provable and efficient strategies for applying N:M transposable fine-grained block sparsity masks. This research addresses existing limitations in accelerating training phases, requiring both forward and backward passes to utilize the same sparse patterns, thereby optimizing computation on modern hardware with sparse tensor cores.

Key Contributions

Mask Diversity Metric: The paper introduces a novel metric, "mask diversity," to evaluate different sparsity masks and their impact on neural network accuracy. Mask diversity quantifies the number of possible mask configurations, correlating this with preservation of model accuracy across varying structured pruning methods, illustrating that higher mask diversity aligns with better performance.
N:M Transposable Fine-grained Sparsity: Implementing sparsity that benefits both inference and training, the paper suggests using transposable sparsity masks. By ensuring that both weight matrices and their transpositions maintain the same sparsity pattern, the method fosters acceleration of backward multiplication. This is formally presented as a minimum-cost flow problem, enabling precise optimization subject to sparsity constraints. Two solution approaches emerge:
- A comprehensive min-cost flow algorithm, offering exact solutions suitable for static pre-trained model restructuring.
- A swift, 2-approximation algorithm for dynamic scenarios, with time complexity nearly linear in size, suited for iterative training procedures.
Experimental Verification and Comparison:
- Static Model Adjustments: Utilizing transposable masks from pretrained dense models (e.g. ResNet50, BERT), achieving optimized dense-to-sparse conversions with efficiency gains, compared against NVIDIA's ASP method.
- Dynamic Training Scenario: Methodologies ensure effective sparse model training, negating the need for pre-training with a dense model. This capability is demonstrated across tasks in computer vision and language processing, maintaining parity or surpassing existing benchmarks in resource-moderate environments.
AdaPrune Method: The paper proposes a conversion technique to adapt between sparsity constraints, enhancing model portability across hardware. AdaPrune effectively remodels existing unstructured models into N:M formats with negligible accuracy loss. Empirically, it demonstrated improved flexibility for hardware compatibility, subverting the need for comprehensive retraining.

Implications and Future Work

This research offers significant implications for enhancing computational efficiency in training neural networks by optimizing both memory and runtime without compromising accuracy. The introduction of transposable masks and the resulting dual acceleration in matrix operations presents a forward path for deep learning infrastructure, notably in resource-constrained or hardware-accelerated environments.

Moving forward, expanding the library of supported mask structures or associated optimization algorithms could further improve performance across diverse hardware architectures. Integrating this approach with additional neural architecture search (NAS) techniques may augment efficiency, offering a dual benefit by refining architectures while enforcing efficient computation via sparse training methods.

Overall, this paper establishes a robust, theoretically grounded framework for enriching neural networks' training efficiency, with practical outcomes evidencing significant throughput enhancements, critical to compelling future development across AI-driven applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Itay Hubara (19 papers)
Brian Chmiel (15 papers)
Moshe Island (1 paper)
Ron Banner (20 papers)
Seffi Naor (1 paper)
Daniel Soudry (76 papers)

Citations (99)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - papers-submission/structured_transposable_masks: Code for ICML 2021 submission (35 stars)

Tweets

https://twitter.com/dojiman_eth/status/1841903201705996416