Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Accelerating Transformer Pre-training with 2:4 Sparsity (2404.01847v3)

Published 2 Apr 2024 in cs.LG

Abstract: Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a ``flip rate'' to monitor the stability of a 2:4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2:4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models, 2020.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.
  3. Findings of the 2014 workshop on statistical machine translation. In WMT@ACL, 2014. URL https://api.semanticscholar.org/CorpusID:15535376.
  4. Exploiting nvidia ampere structured sparsity with cusparselt [online]. 2020 [visited on 2021-10-10].
  5. The lottery ticket hypothesis for pre-trained bert networks, 2020.
  6. Earlybert: Efficient bert training via early-bird lottery tickets, 2021.
  7. Minimum variance unbiased n:m sparsity for the neural gradients. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vuD2xEtxZcj.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  10. Glm: General language model pretraining with autoregressive blank infilling, 2022.
  11. Rigging the lottery: Making all tickets winners, 2021.
  12. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
  13. Stabilizing the lottery ticket hypothesis, 2020.
  14. Cramming: Training a language model on a single gpu in one day, 2022.
  15. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  16. Learning both weights and connections for efficient neural networks, 2015.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016.
  18. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
  19. Accelerated sparse neural training: A provable and efficient method to find n:m transposable masks, 2021.
  20. Karpathy, A. nanogpt. https://github.com/karpathy/nanoGPT/, 2023.
  21. Adam: A method for stochastic optimization, 2017.
  22. Dynamic sparse training with structured sparsity, 2023.
  23. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  24. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pp.  5958–5968. PMLR, 2020.
  25. Decoupled weight decay regularization, 2019.
  26. Step: Learning n:m structured sparsity masks from scratch with precondition, 2023.
  27. Accelerating dnn training with structured data gradient pruning, 2022.
  28. Accelerating sparse deep neural networks, 2021.
  29. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  30. Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083.1073135.
  31. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  33. Shazeer, N. Glu variants improve transformer, 2020.
  34. CUTLASS, January 2023. URL https://github.com/NVIDIA/cutlass.
  35. Training data-efficient image transformers & amp; distillation through attention. In International Conference on Machine Learning, volume 139, pp.  10347–10357, July 2021a.
  36. Training data-efficient image transformers & distillation through attention, 2021b.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  38. Attention is all you need, 2023.
  39. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, 2018. URL https://api.semanticscholar.org/CorpusID:5034059.
  40. Towards fully sparse training: Information restoration with spatial similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  2929–2937, 2022.
  41. Drawing early-bird tickets: Towards more efficient training of deep networks, 2022.
  42. Bi-directional masks for efficient n:m sparse training, 2023.
  43. Learning n:m fine-grained structured sparse neural networks from scratch, 2021.
  44. Go wide, then narrow: Efficient training of deep thin networks. In International Conference on Machine Learning, pp.  11546–11555. PMLR, 2020.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com