Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DB-LLM: Accurate Dual-Binarization for Efficient LLMs (2402.11960v1)

Published 19 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Hong Chen (230 papers)
  2. Chengtao Lv (7 papers)
  3. Liang Ding (158 papers)
  4. Haotong Qin (60 papers)
  5. Xiabin Zhou (3 papers)
  6. Yifu Ding (28 papers)
  7. Xuebo Liu (54 papers)
  8. Min Zhang (630 papers)
  9. Jinyang Guo (28 papers)
  10. Xianglong Liu (128 papers)
  11. Dacheng Tao (826 papers)
Citations (16)
X Twitter Logo Streamline Icon: https://streamlinehq.com