Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity (2309.10285v1)

Published 19 Sep 2023 in cs.DC, cs.AR, and cs.LG

Abstract: With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly-efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which Tensor Cores would be significantly under-utilized due to low computational intensity, we propose a general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse matrix multiplication. The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on Tensor Cores. Based on this, we design an effective software framework for Tensor Core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–15.
  2. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models. https://arxiv.org/abs/2204.06745
  3. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  4. Efficient tensor core-based GPU kernels for structured sparsity under reduced precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
  5. EleutherAI. 2022. GPT-NeoX-20B. https://huggingface.co/EleutherAI/gpt-neox-20b
  6. Hugging Face. 2023. Model Parallelism. https://huggingface.co/docs/transformers/v4.15.0/parallelism
  7. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 431–445.
  8. Elias Frantar and Dan Alistarh. 2023. Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774 (2023).
  9. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323 (2022).
  10. Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.
  11. sputnik github. https://github.com/google-research/sputnik/
  12. Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019).
  13. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 3 (2017), 2.
  14. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
  15. Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28 (2015).
  16. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res. 22, 241 (2021), 1–124.
  17. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 300–314.
  18. Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 1153–1158. https://doi.org/10.1145/3489517.3530588
  19. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
  20. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022, Jiri Schindler and Noa Zilberman (Eds.). USENIX Association, 673–688.
  21. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13.
  22. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  23. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (oct 2017), 29 pages. https://doi.org/10.1145/3133901
  24. CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. Proceedings of the VLDB Endowment 12, 11 (2019).
  25. Efficient quantized sparse matrix operations on tensor cores. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–15.
  26. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 (2020).
  27. Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers. Proc. VLDB Endow. 15, 11 (jul 2022), 2747–2760.
  28. 1xn pattern for pruning convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  29. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018).
  30. Heterogeneity-aware distributed machine learning training via partial reduce. In Proceedings of the 2021 International Conference on Management of Data. 2262–2270.
  31. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16, 3 (nov 2022), 470–479.
  32. Accelerating Sparse Deep Neural Networks. arXiv:2104.08378 [cs.LG]
  33. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11264–11272.
  34. Can Foundation Models Wrangle Your Data? Proc. VLDB Endow. 16, 4 (dec 2022), 738–746.
  35. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. arXiv preprint arXiv:2304.03946 (2023).
  36. NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
  37. NVIDIA. 2022a. NVIDIA Faster-Transformer. https://github.com/NVIDIA/FasterTransformer
  38. NVIDIA. 2022b. NVIDIA H100 Tensor Core GPU Architecture. https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf
  39. NVIDIA. 2023a. cuBLAS Docs. https://docs.nvidia.com/cuda/cublas/index.html
  40. NVIDIA. 2023b. cuSPARSE Library. https://docs.nvidia.com/cuda/cusparse/index.html
  41. NVIDIA. 2023c. cuSPARSELt Library. https://docs.nvidia.com/cuda/cusparselt/
  42. NVIDIA. 2023d. CUTLASS 3.2. https://github.com/NVIDIA/cutlass
  43. NVIDIA. 2023e. Nsight Compute Profiling Guide. https://docs.nvidia.com/nsight-compute/ProfilingGuide/#introduction
  44. NVIDIA. 2023f. Nsight System. https://developer.nvidia.com/nsight-systems
  45. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  46. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
  47. Optimizing Tensor Programs on Flexible Storage. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–27.
  48. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865 (2023).
  49. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  50. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv:2201.11990 [cs.CL]
  51. A Simple and Effective Pruning Approach for Large Language Models. arXiv:2306.11695 [cs.CL]
  52. Immanuel Trummer. 2022. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management. Proc. VLDB Endow. 15, 12 (aug 2022), 3770–3773.
  53. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data. Proc. VLDB Endow. 15, 6 (feb 2022), 1201–1214.
  54. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008 (2017).
  55. Attention is all you need. Advances in neural information processing systems 30 (2017).
  56. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019).
  57. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. arXiv:2112.02052 [cs.LG]
  58. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
  59. SparseTIR: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 660–678.
  60. Optimizing distributed training deployment in heterogeneous GPU clusters. In CoNEXT ’20: The 16th International Conference on emerging Networking EXperiments and Technologies, Barcelona, Spain, December, 2020, Dongsu Han and Anja Feldmann (Eds.). ACM, 93–107.
  61. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 [cs.CL]
  62. MiCS: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment 16, 1 (2022), 37–50.
  63. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, 559–578.
  64. Ningxin Zheng. 2022. SparTA github. https://github.com/microsoft/SparTA/tree/sparta_artifact
  65. {{\{{SparTA}}\}}:{{\{{Deep-Learning}}\}} Model Sparsity via {{\{{Tensor-with-Sparsity-Attribute}}\}}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213–232.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haojun Xia (4 papers)
  2. Zhen Zheng (39 papers)
  3. Yuchao Li (24 papers)
  4. Donglin Zhuang (4 papers)
  5. Zhongzhu Zhou (7 papers)
  6. Xiafei Qiu (5 papers)
  7. Yong Li (628 papers)
  8. Wei Lin (207 papers)
  9. Shuaiwen Leon Song (35 papers)
Citations (5)