Papers
Topics
Authors
Recent
2000 character limit reached

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Published 28 Jun 2023 in cs.LG, cs.AI, and cs.CL | (2306.16601v1)

Abstract: In recent years, Transformer-based LLMs have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based LLMs where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based LLMs including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  4. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020.
  5. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  6. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  7. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2704–2713, 2018.
  8. Optimizing dnn computation with relaxed graph substitutions. Proceedings of Machine Learning and Systems, 1:27–39, 2019.
  9. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  10. I-bert: Integer-only bert quantization. In International conference on machine learning, pp. 5506–5518. PMLR, 2021.
  11. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  12. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  13. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  14. Channel permutations for n: m sparsity. Advances in Neural Information Processing Systems, 34:13316–13327, 2021.
  15. Accelerating inference with sparsity using the nvidia ampere architecture and nvidia tensorrt. NVIDIA Developer Blog, 2021.
  16. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907, 2018.
  17. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  18. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  19. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  20. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
  21. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
  22. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  23. Wang, Z. Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference. arXiv preprint arXiv:2008.11849, 2020.
  24. Wang, Z. Sparsednn: Fast sparse deep learning inference on cpus. arXiv preprint arXiv:2101.07948, 2021.
  25. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  26. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.  36–39. IEEE, 2019.
  27. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021.
  28. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
Citations (2)

Summary

  • The paper introduces a structured sparsity method using a constant 4x1 block design and JIT compilation to optimize sparse matrix multiplication on CPUs.
  • It achieves up to 5x speedup over oneDNN and 345x over PyTorch on Xeon platforms while maintaining model accuracy within 1% of dense implementations.
  • The accelerator integrates seamlessly with Transformer attention mechanisms, offering a cost-effective solution for deploying large models in production environments.

An Efficient Sparse Inference Software Accelerator for Transformer-based LLMs on CPUs

This paper introduces an efficient sparse inference software accelerator designed to enhance the performance of Transformer-based LLMs on CPUs. As Transformer models have scaled to billions of parameters, the resource requirements for inference have become a significant bottleneck. Model compression techniques such as pruning are employed to alleviate these constraints, particularly through structured sparsity.

Core Contributions

The authors propose a structured sparsity pattern with a constant block size of 4x1, taking advantage of Intel\textsuperscript{\tiny\textregistered} Deep Learning Boost to optimize sparse matrix - dense matrix multiplication (SpMM) on CPUs. The performance improvements are substantiated by numerical results showing the SpMM kernel outperforming existing libraries such as oneMKL and TVM by an order of magnitude across various generalized matrix-matrix multiplication (GEMM) shapes and sparsity ratios (70% to 90%).

Performance Highlights:

  • The SpMM kernel demonstrates up to 5x speedup over oneDNN's dense GEMM kernel.
  • The proposed solution offers a 1.5x speedup over Neural Magic's DeepSparse on the same infrastructure and up to 345x over PyTorch on Xeon instances under latency constraints.

Methodological Innovations

  • Structured Sparsity Pattern: The paper investigates block size configurations, choosing a 4x1 pattern optimal for CPU architectures, leading to efficient cache utilization and enhanced computation throughput through AVX-512 instructions.
  • JIT Compilation and Kernel Tuning: Just-in-time compilation is used to generate machine code tailored for specific GEMM shapes, maximizing hardware utilization. This approach significantly boosts the performance of the sparse kernels.
  • Sparse Transformer Attention: By replacing dense Linear operators with sparse counterparts, the authors integrate sparse GEMM into attention mechanisms with fused post-operators to minimize overhead.

Experimental Setup and Results

The benchmarks utilize Intel's Xeon platforms to assess both kernel-level and high-level model performance across multiple Transformer architectures, including BERT-Base and BERT-Large. Results indicate substantial performance gains in throughput and latency compared to existing state-of-the-art solutions. These improvements suggest that structured sparsity patterns can be effectively leveraged without sacrificing model accuracy, which remains within a 1% margin relative to dense models.

Implications and Future Directions

The implications of this work are twofold:

  1. Practical: It enables more cost-effective deployment of large Transformer models in production environments where CPUs are prevalent, such as cloud-based services.
  2. Theoretical: It provides a framework for exploring further optimizations in sparse neural network operations, potentially influencing future CPU architecture design tailored for sparse computations.

Future work as suggested by the authors includes expanding support to non-Intel architectures, such as ARM, and contributing to open-source Transformer libraries to facilitate broader industry adoption. The performance-per-dollar metric stands as an interesting avenue for further investigation, offering users practical benchmarks for cost-efficient deployment in cloud settings.

In summary, this study systematically addresses sparse neural network inference on CPUs and sets a robust precedent for performance optimization via structured sparsity. The thorough empirical evaluation affirms the efficacy of the proposed techniques and sets the stage for further exploration in sparse model deployments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.