Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs
The paper "Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs" addresses the challenges of deploying LLMs for inference on Arm-based CPUs. Given the ever-increasing size and complexity of LLMs, their inference demands significant computational and memory resources, which limits their deployment to high-performance computing environments. However, this research seeks to expand their applicability to more resource-constrained devices, such as smartphones and other edge devices, by developing efficient inference methods tailored for Arm CPU architectures.
The paper introduces a suite of optimized kernels specifically designed for Arm CPUs, capable of leveraging vector and matrix multiplication instructions to exploit the full potential of these processors. These kernels are aimed at mitigating the computational overheads associated with the existing group quantization formats that are widely used for sub-byte precision quantization of LLMs. By focusing on optimizing the matrix-vector (GEMV) and matrix-matrix (GEMM) multiplication operations—essential for efficient LLM inference—the paper demonstrates substantial improvements in terms of inference speed and performance.
A significant portion of the research is devoted to refining the quantization processes of the model's weights. Traditional methods suffer from various inefficiencies due to their handling of weights in groups, which have excessive runtime costs. To address this, the authors propose an interleaved group data layout format and optimized decompression paths which reduce unnecessary operations. These advancements are particularly focused on achieving lower bit-width quantization (as low as 2 bits per weight) without substantial loss in the model's generation quality.
The experimental results are noteworthy, with the optimized kernels providing 3 to 3.2 times faster inference speed during prompt processing and up to 2 times improvement in autoregressive decoding on a single Arm CPU core when compared to conventional methods like those used in the LLaMA.cpp framework. Additionally, the group-wise, non-uniform quantization method shows a promising increase in quality, achieving better perplexity scores than current state-of-the-art quantization methods.
The implications of this research are multifaceted. Practically, the ability to efficiently run LLMs on ubiquitous Arm CPUs opens up avenues for deploying these models on billions of end-user devices, enhancing applications from natural language processing to real-time language translation. Theoretically, this work contributes to the ongoing discourse on model compression and inference efficiency, particularly in navigating the trade-offs between computational precision and performance.
Looking forward, future research might build on this foundation by exploring other potential quantization schemes and algorithmic optimizations that are compatible with emerging Arm architectures. Additionally, extending these methods to other types of accelerators within the edge computing landscape could further enhance the mobility and applicability of LLMs, thereby widening access to advanced AI tools and technologies.