An Efficient Sparse Inference Software Accelerator for Transformer-based LLMs on CPUs
This paper introduces an efficient sparse inference software accelerator designed to enhance the performance of Transformer-based LLMs on CPUs. As Transformer models have scaled to billions of parameters, the resource requirements for inference have become a significant bottleneck. Model compression techniques such as pruning are employed to alleviate these constraints, particularly through structured sparsity.
Core Contributions
The authors propose a structured sparsity pattern with a constant block size of 4x1, taking advantage of Intel\textsuperscript{\tiny\textregistered} Deep Learning Boost to optimize sparse matrix - dense matrix multiplication (SpMM) on CPUs. The performance improvements are substantiated by numerical results showing the SpMM kernel outperforming existing libraries such as oneMKL and TVM by an order of magnitude across various generalized matrix-matrix multiplication (GEMM) shapes and sparsity ratios (70% to 90%).
Performance Highlights:
- The SpMM kernel demonstrates up to 5x speedup over oneDNN's dense GEMM kernel.
- The proposed solution offers a 1.5x speedup over Neural Magic's DeepSparse on the same infrastructure and up to 345x over PyTorch on Xeon instances under latency constraints.
Methodological Innovations
- Structured Sparsity Pattern: The paper investigates block size configurations, choosing a 4x1 pattern optimal for CPU architectures, leading to efficient cache utilization and enhanced computation throughput through AVX-512 instructions.
- JIT Compilation and Kernel Tuning: Just-in-time compilation is used to generate machine code tailored for specific GEMM shapes, maximizing hardware utilization. This approach significantly boosts the performance of the sparse kernels.
- Sparse Transformer Attention: By replacing dense Linear operators with sparse counterparts, the authors integrate sparse GEMM into attention mechanisms with fused post-operators to minimize overhead.
Experimental Setup and Results
The benchmarks utilize Intel's Xeon platforms to assess both kernel-level and high-level model performance across multiple Transformer architectures, including BERT-Base and BERT-Large. Results indicate substantial performance gains in throughput and latency compared to existing state-of-the-art solutions. These improvements suggest that structured sparsity patterns can be effectively leveraged without sacrificing model accuracy, which remains within a 1% margin relative to dense models.
Implications and Future Directions
The implications of this work are twofold:
- Practical: It enables more cost-effective deployment of large Transformer models in production environments where CPUs are prevalent, such as cloud-based services.
- Theoretical: It provides a framework for exploring further optimizations in sparse neural network operations, potentially influencing future CPU architecture design tailored for sparse computations.
Future work as suggested by the authors includes expanding support to non-Intel architectures, such as ARM, and contributing to open-source Transformer libraries to facilitate broader industry adoption. The performance-per-dollar metric stands as an interesting avenue for further investigation, offering users practical benchmarks for cost-efficient deployment in cloud settings.
In summary, this paper systematically addresses sparse neural network inference on CPUs and sets a robust precedent for performance optimization via structured sparsity. The thorough empirical evaluation affirms the efficacy of the proposed techniques and sets the stage for further exploration in sparse model deployments.