Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs (2501.00032v1)

Published 23 Dec 2024 in cs.LG, cs.AI, cs.AR, and cs.CL
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Abstract: LLMs have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

The paper "Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs" addresses the challenges of deploying LLMs for inference on Arm-based CPUs. Given the ever-increasing size and complexity of LLMs, their inference demands significant computational and memory resources, which limits their deployment to high-performance computing environments. However, this research seeks to expand their applicability to more resource-constrained devices, such as smartphones and other edge devices, by developing efficient inference methods tailored for Arm CPU architectures.

The paper introduces a suite of optimized kernels specifically designed for Arm CPUs, capable of leveraging vector and matrix multiplication instructions to exploit the full potential of these processors. These kernels are aimed at mitigating the computational overheads associated with the existing group quantization formats that are widely used for sub-byte precision quantization of LLMs. By focusing on optimizing the matrix-vector (GEMV) and matrix-matrix (GEMM) multiplication operations—essential for efficient LLM inference—the paper demonstrates substantial improvements in terms of inference speed and performance.

A significant portion of the research is devoted to refining the quantization processes of the model's weights. Traditional methods suffer from various inefficiencies due to their handling of weights in groups, which have excessive runtime costs. To address this, the authors propose an interleaved group data layout format and optimized decompression paths which reduce unnecessary operations. These advancements are particularly focused on achieving lower bit-width quantization (as low as 2 bits per weight) without substantial loss in the model's generation quality.

The experimental results are noteworthy, with the optimized kernels providing 3 to 3.2 times faster inference speed during prompt processing and up to 2 times improvement in autoregressive decoding on a single Arm CPU core when compared to conventional methods like those used in the LLaMA.cpp framework. Additionally, the group-wise, non-uniform quantization method shows a promising increase in quality, achieving better perplexity scores than current state-of-the-art quantization methods.

The implications of this research are multifaceted. Practically, the ability to efficiently run LLMs on ubiquitous Arm CPUs opens up avenues for deploying these models on billions of end-user devices, enhancing applications from natural language processing to real-time language translation. Theoretically, this work contributes to the ongoing discourse on model compression and inference efficiency, particularly in navigating the trade-offs between computational precision and performance.

Looking forward, future research might build on this foundation by exploring other potential quantization schemes and algorithmic optimizations that are compatible with emerging Arm architectures. Additionally, extending these methods to other types of accelerators within the edge computing landscape could further enhance the mobility and applicability of LLMs, thereby widening access to advanced AI tools and technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dibakar Gope (17 papers)
  2. David Mansell (1 paper)
  3. Danny Loh (4 papers)
  4. Ian Bratt (1 paper)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com