1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs (2410.16144v2)

Published 21 Oct 2024 in cs.CL

Abstract: Recent advances in 1-bit LLMs, such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

PDF HTML Abstract

Overview of "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs"

The paper presents advances in the optimization of 1-bit LLMs, specifically focusing on BitNet b1.58. These optimizations improve efficiency in terms of speed and energy consumption during inference, facilitating the deployment of LLMs across various devices. The key contribution of this paper is the introduction of bitnet.cpp, a software framework tailored for fast and lossless inference of ternary BitNet b1.58 models on CPUs.

Framework Design and Implementation

The framework, bitnet.cpp, consists of optimized kernels that enable efficient inference on both x86 and ARM CPU architectures. The paper discusses several kernel optimizations for 1.58-bit models:

I2_S Kernel: Converts full-precision weights into 2-bit values, optimizing memory and bandwidth usage.
TL1 and TL2 Kernels: Utilize lookup tables for fast computation by compressing weights into indices. TL2 achieves a higher compression ratio, making it ideal for environments with constrained memory.

Performance and Energy Efficiency

Extensive experiments were conducted to validate the framework's performance. The results demonstrate significant speedups, with bitnet.cpp outperforming llama.cpp, achieving speedups from 1.37x to 6.46x, depending on the architecture and model size. Notably, larger models (13B and above) benefit substantially from these optimizations.

Energy consumption is also effectively reduced. On ARM CPUs, energy consumption decreased by up to 70%, while x86 architectures saw reductions as high as 82.2%. These improvements highlight bitnet.cpp's potential for practical LLM deployment in resource-limited settings.

Inference Accuracy

The paper highlights bitnet.cpp's capability to perform exact, lossless inference. Tests using a 700M BitNet b1.58 model confirmed 100% accuracy, demonstrating the framework's competence in preserving model precision while harnessing efficiency gains.

Implications and Future Directions

The development of bitnet.cpp has significant implications for both theoretical and practical advancements:

Theoretical: This work pushes the boundaries of low-bit LLM inference, presenting a seamless blend of speed and efficiency without sacrificing accuracy.
Practical: By enabling efficient local inference on CPUs, this framework facilitates the deployment of large models in diverse environments, including mobile and edge devices.

Future work, as mentioned in the paper, will expand bitnet.cpp to support additional platforms such as NPUs and GPUs, and further explore optimizations for 1-bit LLM training. This suggests a continuous refinement of the infrastructure co-design between hardware and software for low-bit inference.

In conclusion, the paper provides a comprehensive contribution to 1-bit LLM optimizations, demonstrating substantial advances in model efficiency, thereby broadening the applicability of LLMs across a wider range of devices and applications.