An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (2306.16601v1)

Published 28 Jun 2023 in cs.LG, cs.AI, and cs.CL

Abstract: In recent years, Transformer-based LLMs have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based LLMs where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based LLMs including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

PDF HTML Abstract

An Efficient Sparse Inference Software Accelerator for Transformer-based LLMs on CPUs

This paper introduces an efficient sparse inference software accelerator designed to enhance the performance of Transformer-based LLMs on CPUs. As Transformer models have scaled to billions of parameters, the resource requirements for inference have become a significant bottleneck. Model compression techniques such as pruning are employed to alleviate these constraints, particularly through structured sparsity.

Core Contributions

The authors propose a structured sparsity pattern with a constant block size of 4x1, taking advantage of Intel\textsuperscript{\tiny\textregistered} Deep Learning Boost to optimize sparse matrix - dense matrix multiplication (SpMM) on CPUs. The performance improvements are substantiated by numerical results showing the SpMM kernel outperforming existing libraries such as oneMKL and TVM by an order of magnitude across various generalized matrix-matrix multiplication (GEMM) shapes and sparsity ratios (70% to 90%).

Performance Highlights:

The SpMM kernel demonstrates up to 5x speedup over oneDNN's dense GEMM kernel.
The proposed solution offers a 1.5x speedup over Neural Magic's DeepSparse on the same infrastructure and up to 345x over PyTorch on Xeon instances under latency constraints.

Methodological Innovations

Structured Sparsity Pattern: The paper investigates block size configurations, choosing a 4x1 pattern optimal for CPU architectures, leading to efficient cache utilization and enhanced computation throughput through AVX-512 instructions.
JIT Compilation and Kernel Tuning: Just-in-time compilation is used to generate machine code tailored for specific GEMM shapes, maximizing hardware utilization. This approach significantly boosts the performance of the sparse kernels.
Sparse Transformer Attention: By replacing dense Linear operators with sparse counterparts, the authors integrate sparse GEMM into attention mechanisms with fused post-operators to minimize overhead.

Experimental Setup and Results

The benchmarks utilize Intel's Xeon platforms to assess both kernel-level and high-level model performance across multiple Transformer architectures, including BERT-Base and BERT-Large. Results indicate substantial performance gains in throughput and latency compared to existing state-of-the-art solutions. These improvements suggest that structured sparsity patterns can be effectively leveraged without sacrificing model accuracy, which remains within a 1% margin relative to dense models.

Implications and Future Directions

The implications of this work are twofold:

Practical: It enables more cost-effective deployment of large Transformer models in production environments where CPUs are prevalent, such as cloud-based services.
Theoretical: It provides a framework for exploring further optimizations in sparse neural network operations, potentially influencing future CPU architecture design tailored for sparse computations.

Future work as suggested by the authors includes expanding support to non-Intel architectures, such as ARM, and contributing to open-source Transformer libraries to facilitate broader industry adoption. The performance-per-dollar metric stands as an interesting avenue for further investigation, offering users practical benchmarks for cost-efficient deployment in cloud settings.

In summary, this paper systematically addresses sparse neural network inference on CPUs and sets a robust precedent for performance optimization via structured sparsity. The thorough empirical evaluation affirms the efficacy of the proposed techniques and sets the stage for further exploration in sparse model deployments.

PDF Markdown Bookmark Chat (Pro)

References (28)

Authors (12)

Haihao Shen (11 papers)
Hengyu Meng (7 papers)
Bo Dong (50 papers)
Zhe Wang (574 papers)
Ofir Zafrir (5 papers)
Yi Ding (92 papers)
Yu Luo (143 papers)
Hanwen Chang (4 papers)
Qun Gao (3 papers)
Ziheng Wang (48 papers)
Guy Boudoukh (5 papers)
Moshe Wasserblat (22 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ (2,001 stars)