Accelerating Transformer Pre-training with 2:4 Sparsity (2404.01847v3)
Abstract: Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a ``flip rate'' to monitor the stability of a 2:4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2:4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.
- Carbontracker: Tracking and predicting the carbon footprint of training deep learning models, 2020.
- Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.
- Findings of the 2014 workshop on statistical machine translation. In WMT@ACL, 2014. URL https://api.semanticscholar.org/CorpusID:15535376.
- Exploiting nvidia ampere structured sparsity with cusparselt [online]. 2020 [visited on 2021-10-10].
- The lottery ticket hypothesis for pre-trained bert networks, 2020.
- Earlybert: Efficient bert training via early-bird lottery tickets, 2021.
- Minimum variance unbiased n:m sparsity for the neural gradients. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vuD2xEtxZcj.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Glm: General language model pretraining with autoregressive blank infilling, 2022.
- Rigging the lottery: Making all tickets winners, 2021.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
- Stabilizing the lottery ticket hypothesis, 2020.
- Cramming: Training a language model on a single gpu in one day, 2022.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Learning both weights and connections for efficient neural networks, 2015.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
- Accelerated sparse neural training: A provable and efficient method to find n:m transposable masks, 2021.
- Karpathy, A. nanogpt. https://github.com/karpathy/nanoGPT/, 2023.
- Adam: A method for stochastic optimization, 2017.
- Dynamic sparse training with structured sparsity, 2023.
- Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pp. 5958–5968. PMLR, 2020.
- Decoupled weight decay regularization, 2019.
- Step: Learning n:m structured sparsity masks from scratch with precondition, 2023.
- Accelerating dnn training with structured data gradient pruning, 2022.
- Accelerating sparse deep neural networks, 2021.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083.1073135.
- Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Shazeer, N. Glu variants improve transformer, 2020.
- CUTLASS, January 2023. URL https://github.com/NVIDIA/cutlass.
- Training data-efficient image transformers & amp; distillation through attention. In International Conference on Machine Learning, volume 139, pp. 10347–10357, July 2021a.
- Training data-efficient image transformers & distillation through attention, 2021b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, 2018. URL https://api.semanticscholar.org/CorpusID:5034059.
- Towards fully sparse training: Information restoration with spatial similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 2929–2937, 2022.
- Drawing early-bird tickets: Towards more efficient training of deep networks, 2022.
- Bi-directional masks for efficient n:m sparse training, 2023.
- Learning n:m fine-grained structured sparse neural networks from scratch, 2021.
- Go wide, then narrow: Efficient training of deep thin networks. In International Conference on Machine Learning, pp. 11546–11555. PMLR, 2020.