Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient LLM Inference on CPUs (2311.00502v2)

Published 1 Nov 2023 in cs.LG, cs.AI, and cs.CL
Efficient LLM Inference on CPUs

Abstract: LLMs have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

Efficient LLM Inference on CPUs

The paper "Efficient LLM Inference on CPUs" addresses a critical challenge in the deployment of LLMs: the significant computational demands resulting from their vast parameter sizes. The authors propose a focused approach centered on the automatic quantization of model weights to the INT4 format and the design of a specialized runtime environment for CPU inference. This strategy aims to significantly reduce memory usage and enhance processing efficiency without substantial accuracy loss.

The research outlines two primary areas of contribution: the automatic INT4 quantization process and the efficient LLM runtime execution. The quantization process leverages the Intel Neural Compressor to convert weights to the INT4 format while retaining precision. This method addresses the limitations of traditional INT8 quantization by avoiding outliers and employing a weight-only approach, maintaining activations at higher precision levels (e.g., FP16). The efficacy of this process is underlined by the negligible accuracy loss observed across various model architectures, remaining within 1% of the original FP32 baseline.

The proposed runtime, specifically designed for CPUs, incorporates a comprehensive tensor library and adapts to varied instruction sets such as AVX2, AVX512, and AMX. This design ensures compatibility with existing hardware features, particularly those available in Intel’s Xeon processors. The runtime achieves notable improvements in efficiency, as reflected in latency reductions for next-token generation, ranging between 20ms and 80ms across models with parameters spanning 3B to 20B.

In terms of empirical evaluation, the paper demonstrates the general applicability of this approach across several LLMs, including Llama2, Llama, and GPT-NeoX. The results affirm the performance gains, with the INT4-enabled CPU inference exhibiting up to a 1.6x speed advantage over ggml-based solutions. This positions the introduced approach as a robust alternative to traditional GPU-based deployments, offering a practical solution for scenarios where CPU usage is preferred or necessary.

The accuracy evaluations performed on datasets like lambada and hellaswag support the conclusion that INT4 quantization can maintain model performance close to the FP32 benchmarks. The performance assessments reveal the runtime system's substantial enhancements over existing solutions, emphasizing its applicability in real-world settings.

Notably, the paper sidesteps potential issues associated with memory reallocations during inference through effective KV cache optimizations. These modifications prevent unnecessary computational overheads and streamline operations, critical for ensuring rapid processing times.

Future work suggested by the authors includes further enhancements to the CPU tensor library and community contributions to extend this capability within open-source ecosystems like Hugging Face. Moreover, the paper opens pathways for broader adoption across personal computing platforms, underlining the growing versatility and accessibility of AI technologies on commodity hardware.

In summary, this paper provides a well-grounded and empirically validated framework for deploying LLMs efficiently on CPUs. By leveraging INT4 quantization and optimized runtime environments, it contributes significantly to the ongoing discourse on making AI more accessible, cost-effective, and energy-efficient. The implications for practical AI applications are extensive, particularly in scenarios where computational resources are constrained or GPU availability is limited.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Teq: Trainable equivalent transformation for quantization of llms. arXiv preprint arXiv:2310.10944, 2023a.
  4. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516, 2023b.
  5. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  6. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  7. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  8. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015. URL https://arxiv.org/abs/1510.00149.
  9. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  10. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  11. Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems, 35:14651–14662, 2022.
  12. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  13. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  14. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  15. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  16. Lower numerical precision deep learning inference and training. Intel White Paper, 3(1):19, 2018.
  17. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  18. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  19. Efficient post-training quantization with fp8 formats, 2023.
  20. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019.
  21. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  23. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
  24. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  25. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  26. Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782, 2023.
  27. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  28. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  29. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haihao Shen (11 papers)
  2. Hanwen Chang (4 papers)
  3. Bo Dong (50 papers)
  4. Yu Luo (143 papers)
  5. Hengyu Meng (7 papers)
Citations (11)