Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design (2401.14112v2)

Published 25 Jan 2024 in cs.LG, cs.AI, and cs.AR
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Abstract: Six-bit quantization (FP6) can effectively reduce the size of LLMs and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_LLM.

Introduction

LLMs have become central to numerous natural language processing tasks with their unparalleled ability to understand and generate human-like text. Deployment, however, remains a significant challenge, principally due to their extensive memory requirements and computational costs. Conventional methods frequently resort to larger-than-necessary data types such as FP16 for weights representation during inference, exacerbating these challenges. Recognition of 6-bit quantization (FP6) as a promising alternative has been growing, given its potential to balance inference cost and model quality proficiently.

6-bit Quantization Challenges and FP6-Centric Solution

The paper posits that existing systems lack support for FP6 data types and struggle with practical performance enhancements. Two main hurdles are the unfriendly memory access patterns caused by irregular bit-widths of model weights and the runtime overhead for weight de-quantization. To counter these challenges, the authors propose a full-stack GPU kernel design scheme, TC-FPx. This is the first to provide unified support across various quantization bit-widths. By incorporating this framework into their inference system dubbed FP6-LLM, the authors forge new pathways for quantized LLM inference that promises fiscally and computationally more efficient trade-offs without compromising quality.

FP6-LLM Empirical Advantages

Empirical benchmarks illustrate that FP6-LLM, leveraging the TC-FPx design, can serve models such as LLaMA-70b on a single GPU, significantly increasing normalized inference throughput by up to 2.65x compared to FP16 baseline executions. Crucially, these benefits stem from choices like ahead-of-time bit-level pre-packing and SIMT-efficient de-quantization runtime that together circumvent GPU memory access issues and dilute computation overheads during inference.

Conclusion

The paper concludes with the assertion of FP6-LLM's capabilities to efficiently facilitate the inference process for LLMs, relying on innovative algorithm-system co-design. By effectively overhauling the GPU kernel infrastructure to include FP6 support, it opens the door for wider adoption of quantization strategies. Consequently, FP6-LLM stands as a promising solution for deploying large, computationally demanding LLMs more broadly, enhancing their practicality and accessibility.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Haojun Xia (4 papers)
  2. Zhen Zheng (39 papers)
  3. Xiaoxia Wu (30 papers)
  4. Shiyang Chen (23 papers)
  5. Zhewei Yao (64 papers)
  6. Stephen Youn (4 papers)
  7. Arash Bakhtiari (5 papers)
  8. Michael Wyatt (6 papers)
  9. Donglin Zhuang (4 papers)
  10. Zhongzhu Zhou (7 papers)
  11. Olatunji Ruwase (20 papers)
  12. Yuxiong He (59 papers)
  13. Shuaiwen Leon Song (35 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com