Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (2306.16601v1)

Published 28 Jun 2023 in cs.LG, cs.AI, and cs.CL
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Abstract: In recent years, Transformer-based LLMs have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based LLMs where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based LLMs including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

An Efficient Sparse Inference Software Accelerator for Transformer-based LLMs on CPUs

This paper introduces an efficient sparse inference software accelerator designed to enhance the performance of Transformer-based LLMs on CPUs. As Transformer models have scaled to billions of parameters, the resource requirements for inference have become a significant bottleneck. Model compression techniques such as pruning are employed to alleviate these constraints, particularly through structured sparsity.

Core Contributions

The authors propose a structured sparsity pattern with a constant block size of 4x1, taking advantage of Intel\textsuperscript{\tiny\textregistered} Deep Learning Boost to optimize sparse matrix - dense matrix multiplication (SpMM) on CPUs. The performance improvements are substantiated by numerical results showing the SpMM kernel outperforming existing libraries such as oneMKL and TVM by an order of magnitude across various generalized matrix-matrix multiplication (GEMM) shapes and sparsity ratios (70% to 90%).

Performance Highlights:

  • The SpMM kernel demonstrates up to 5x speedup over oneDNN's dense GEMM kernel.
  • The proposed solution offers a 1.5x speedup over Neural Magic's DeepSparse on the same infrastructure and up to 345x over PyTorch on Xeon instances under latency constraints.

Methodological Innovations

  • Structured Sparsity Pattern: The paper investigates block size configurations, choosing a 4x1 pattern optimal for CPU architectures, leading to efficient cache utilization and enhanced computation throughput through AVX-512 instructions.
  • JIT Compilation and Kernel Tuning: Just-in-time compilation is used to generate machine code tailored for specific GEMM shapes, maximizing hardware utilization. This approach significantly boosts the performance of the sparse kernels.
  • Sparse Transformer Attention: By replacing dense Linear operators with sparse counterparts, the authors integrate sparse GEMM into attention mechanisms with fused post-operators to minimize overhead.

Experimental Setup and Results

The benchmarks utilize Intel's Xeon platforms to assess both kernel-level and high-level model performance across multiple Transformer architectures, including BERT-Base and BERT-Large. Results indicate substantial performance gains in throughput and latency compared to existing state-of-the-art solutions. These improvements suggest that structured sparsity patterns can be effectively leveraged without sacrificing model accuracy, which remains within a 1% margin relative to dense models.

Implications and Future Directions

The implications of this work are twofold:

  1. Practical: It enables more cost-effective deployment of large Transformer models in production environments where CPUs are prevalent, such as cloud-based services.
  2. Theoretical: It provides a framework for exploring further optimizations in sparse neural network operations, potentially influencing future CPU architecture design tailored for sparse computations.

Future work as suggested by the authors includes expanding support to non-Intel architectures, such as ARM, and contributing to open-source Transformer libraries to facilitate broader industry adoption. The performance-per-dollar metric stands as an interesting avenue for further investigation, offering users practical benchmarks for cost-efficient deployment in cloud settings.

In summary, this paper systematically addresses sparse neural network inference on CPUs and sets a robust precedent for performance optimization via structured sparsity. The thorough empirical evaluation affirms the efficacy of the proposed techniques and sets the stage for further exploration in sparse model deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  4. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020.
  5. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  6. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  7. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2704–2713, 2018.
  8. Optimizing dnn computation with relaxed graph substitutions. Proceedings of Machine Learning and Systems, 1:27–39, 2019.
  9. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  10. I-bert: Integer-only bert quantization. In International conference on machine learning, pp. 5506–5518. PMLR, 2021.
  11. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  12. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  13. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  14. Channel permutations for n: m sparsity. Advances in Neural Information Processing Systems, 34:13316–13327, 2021.
  15. Accelerating inference with sparsity using the nvidia ampere architecture and nvidia tensorrt. NVIDIA Developer Blog, 2021.
  16. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907, 2018.
  17. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  18. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  19. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  20. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
  21. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
  22. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  23. Wang, Z. Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference. arXiv preprint arXiv:2008.11849, 2020.
  24. Wang, Z. Sparsednn: Fast sparse deep learning inference on cpus. arXiv preprint arXiv:2101.07948, 2021.
  25. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  26. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.  36–39. IEEE, 2019.
  27. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021.
  28. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Haihao Shen (11 papers)
  2. Hengyu Meng (7 papers)
  3. Bo Dong (50 papers)
  4. Zhe Wang (574 papers)
  5. Ofir Zafrir (5 papers)
  6. Yi Ding (92 papers)
  7. Yu Luo (143 papers)
  8. Hanwen Chang (4 papers)
  9. Qun Gao (3 papers)
  10. Ziheng Wang (48 papers)
  11. Guy Boudoukh (5 papers)
  12. Moshe Wasserblat (22 papers)
Citations (2)