Efficient LLM Inference on CPUs (2311.00502v2)

Published 1 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

PDF HTML Abstract

Efficient LLM Inference on CPUs

The paper "Efficient LLM Inference on CPUs" addresses a critical challenge in the deployment of LLMs: the significant computational demands resulting from their vast parameter sizes. The authors propose a focused approach centered on the automatic quantization of model weights to the INT4 format and the design of a specialized runtime environment for CPU inference. This strategy aims to significantly reduce memory usage and enhance processing efficiency without substantial accuracy loss.

The research outlines two primary areas of contribution: the automatic INT4 quantization process and the efficient LLM runtime execution. The quantization process leverages the Intel Neural Compressor to convert weights to the INT4 format while retaining precision. This method addresses the limitations of traditional INT8 quantization by avoiding outliers and employing a weight-only approach, maintaining activations at higher precision levels (e.g., FP16). The efficacy of this process is underlined by the negligible accuracy loss observed across various model architectures, remaining within 1% of the original FP32 baseline.

The proposed runtime, specifically designed for CPUs, incorporates a comprehensive tensor library and adapts to varied instruction sets such as AVX2, AVX512, and AMX. This design ensures compatibility with existing hardware features, particularly those available in Intel’s Xeon processors. The runtime achieves notable improvements in efficiency, as reflected in latency reductions for next-token generation, ranging between 20ms and 80ms across models with parameters spanning 3B to 20B.

In terms of empirical evaluation, the paper demonstrates the general applicability of this approach across several LLMs, including Llama2, Llama, and GPT-NeoX. The results affirm the performance gains, with the INT4-enabled CPU inference exhibiting up to a 1.6x speed advantage over ggml-based solutions. This positions the introduced approach as a robust alternative to traditional GPU-based deployments, offering a practical solution for scenarios where CPU usage is preferred or necessary.

The accuracy evaluations performed on datasets like lambada and hellaswag support the conclusion that INT4 quantization can maintain model performance close to the FP32 benchmarks. The performance assessments reveal the runtime system's substantial enhancements over existing solutions, emphasizing its applicability in real-world settings.

Notably, the paper sidesteps potential issues associated with memory reallocations during inference through effective KV cache optimizations. These modifications prevent unnecessary computational overheads and streamline operations, critical for ensuring rapid processing times.

Future work suggested by the authors includes further enhancements to the CPU tensor library and community contributions to extend this capability within open-source ecosystems like Hugging Face. Moreover, the paper opens pathways for broader adoption across personal computing platforms, underlining the growing versatility and accessibility of AI technologies on commodity hardware.

In summary, this paper provides a well-grounded and empirically validated framework for deploying LLMs efficiently on CPUs. By leveraging INT4 quantization and optimized runtime environments, it contributes significantly to the ongoing discourse on making AI more accessible, cost-effective, and energy-efficient. The implications for practical AI applications are extensive, particularly in scenarios where computational resources are constrained or GPU availability is limited.

PDF Markdown Bookmark Chat (Pro)

References (29)

Authors (5)

Haihao Shen (11 papers)
Hanwen Chang (4 papers)
Bo Dong (50 papers)
Yu Luo (143 papers)
Hengyu Meng (7 papers)

Citations (11)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ (2,167 stars)