Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Speed Odyssey for Deployable Quantization of LLMs (2311.09550v1)

Published 16 Nov 2023 in cs.LG and cs.CL

Abstract: The LLM era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qingyuan Li (11 papers)
  2. Ran Meng (3 papers)
  3. Yiduo Li (7 papers)
  4. Bo Zhang (633 papers)
  5. Liang Li (297 papers)
  6. Yifan Lu (38 papers)
  7. Xiangxiang Chu (62 papers)
  8. Yerui Sun (4 papers)
  9. Yuchen Xie (12 papers)
Citations (7)