Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge (2407.00088v2)

Published 25 Jun 2024 in cs.DC and cs.AI
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Abstract: The deployment of LLMs on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC .

T-MAC: Enhancing Low-Bit LLM Deployment on Edge Devices Using Table Lookup

The deployment of LLMs on edge devices represents a growing area of interest due to the potential benefits of improved on-device intelligence and reduced response latency. The paper presents T-MAC, a system designed to optimize the deployment of low-bit, weight-quantized LLMs on CPUs by transforming existing computational paradigms to leverage a lookup table (LUT)-based approach.

Weight Quantization and Mixed Precision Challenges

Weight quantization is crucial for reducing the memory footprint of LLMs on edge devices. However, the translation of low-bit weights to high-precision activations poses a significant computational challenge. Mixed precision General Matrix Multiplication (mpGEMM) is necessitated, yet current hardware lacks native support for such operations, often requiring inefficient dequantization techniques leading to performance bottlenecks.

T-MAC Approach

The T-MAC framework introduces an innovative LUT-based methodology, facilitating mpGEMM without the need for dequantization and thereby circumventing multiplication operations. The key to T-MAC's approach is transforming traditional multiplication operations into bit-wise table lookups, resulting in efficient and scalable matrix multiplication regardless of the bit-widths of weights and activations.

Performance Evaluation

T-MAC's novel approach was evaluated on Llama and BitNet models, exhibiting substantial improvements in throughput—up to a 4× increase—and reducing energy consumption by 70% compared to state-of-the-art implementations like llama.cpp. Specifically, T-MAC achieved token generation throughputs of up to 71 tokens/s on high-end devices such as the M2-Ultra and even delivered a commendable 11 tokens/s on resource-constrained platforms like the Raspberry Pi 5.

Technical Innovations

  • Table Lookup Optimization: By precomputing potential weight-activation product values and storing them in LUTs, T-MAC efficiently substitutes traditional multiplicative operations with lookup operations, supported by optimized memory layouts and register management.
  • Data Layout and Reduction Techniques: Techniques such as axis reordering and LUT-centric tiling allow for data to be processed more efficiently, reducing redundant operations and optimizing both speed and memory footprint.
  • System Implementation: The T-MAC system is implemented across various processors and edge devices, providing open-source access for further development and integration, underscoring its versatility and practical applicability.

Implications and Future Directions

The implications of T-MAC are twofold:

  1. Practically, it provides a framework for running low-bit LLMs on edge devices with competitive efficiency without relying heavily on GPUs.
  2. Theoretically, it opens avenues for research into more efficient computational paradigms and hardware architectures based on the LUT paradigm, potentially influencing future hardware designs tailored to AI operations.

Future research could explore further optimization of LUT methods, alternative quantization strategies, and the applicability of T-MAC to other model architectures beyond LLMs. Additionally, adaptable versions for different hardware architectures could broaden the utility of T-MAC, reinforcing its relevance in the evolving landscape of edge computing and AI deployment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jianyu Wei (5 papers)
  2. Shijie Cao (20 papers)
  3. Ting Cao (100 papers)
  4. Lingxiao Ma (14 papers)
  5. Lei Wang (975 papers)
  6. Yanyong Zhang (63 papers)
  7. Mao Yang (62 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com