Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast DistilBERT on CPUs (2211.07715v2)

Published 27 Oct 2022 in cs.CL, cs.AI, and cs.LG
Fast DistilBERT on CPUs

Abstract: Transformer-based LLMs have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime. Source code is publicly available at https://github.com/intel/intel-extension-for-transformers.

Fast DistilBERT on CPUs: An Overview

The paper "Fast DistilBERT on CPUs" by Shen et al. demonstrates a comprehensive approach to improving the efficiency of Transformer models on CPUs through novel hardware-aware model compression techniques. The authors address the computational inefficiencies associated with large Transformer models, which are commonly employed for NLP but are often not feasible for deployment in production environments due to latency constraints.

Key Contributions

The core contributions of this research involve several innovative strategies:

  1. Hardware-aware Model Compression: The authors extend existing compression methods by using a combination of block-wise structured sparsity, knowledge distillation, and post-training quantization. This approach leads to extreme compression while maintaining accuracy. Specifically, the paper focuses on creating a highly efficient DistilBERT model for CPU inference.
  2. Transformer Inference Engine: The paper also introduces a dedicated Transformer inference engine that effectively handles sparse and quantized models. This engine includes advanced runtime optimizations like memory allocation improvements and weight sharing, alongside optimized sparse General Matrix Multiplication (GEMM) operators for CPUs.
  3. State-of-the-art Performance: The authors present empirical results demonstrating that their techniques achieve significant performance gains over established frameworks such as Neural Magic's DeepSparse and ONNX Runtime. Notably, they report up to a 50% improvement over DeepSparse and a 4.1x speedup over ONNX Runtime under realistic production settings.

Experimental Evaluation

The authors evaluate their methods on a popular NLP task, the question-answering benchmark SQuADv1.1, utilizing a sparsified and quantized DistilBERT model. They achieve competitive F1 scores while adhering to stringent hardware constraints typical of production environments.

The experimental setup compares maximum throughput and latency performance across different scenarios. The authors underscore performance improvements on AWS EC2 instances, highlighting the general applicability of their techniques across standard CPU configurations.

Implications and Future Directions

Practically, this research has profound implications for deploying NLP models in environments constrained by hardware and latency. The proposed compression pipeline and inference engine allow enterprises to leverage advanced LLMs without the need for specialized accelerators.

Theoretically, this work paves the way for further exploration into the synergistic effect of model compression techniques and hardware-aware optimizations. Future developments could expand these strategies to larger models and other Transformer architectures, broadening the scope of AI applications feasible for CPU deployment.

Continued research could also explore the integration of similar techniques into other domains, potentially enhancing the efficiency and reach of AI capabilities across diverse applications.

Conclusion

This paper offers a detailed roadmap for achieving efficient, high-performance Transformer model inference on CPUs. By pushing the boundaries of model compression and optimized inference, the authors contribute valuable insights for both academia and industry. Their open-source code availability encourages further experimentation and collaboration within the research community, fostering continued advancements in AI deployment infrastructure.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Haihao Shen (11 papers)
  2. Ofir Zafrir (5 papers)
  3. Bo Dong (50 papers)
  4. Hengyu Meng (7 papers)
  5. Xinyu Ye (6 papers)
  6. Zhe Wang (574 papers)
  7. Yi Ding (92 papers)
  8. Hanwen Chang (4 papers)
  9. Guy Boudoukh (5 papers)
  10. Moshe Wasserblat (22 papers)
Citations (2)