Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accurate Block Quantization in LLMs with Outliers (2403.20137v1)

Published 29 Mar 2024 in cs.AI, cs.AR, cs.NA, and math.NA

Abstract: The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “OPT: open pre-trained transformer language models” In arXiv preprint arXiv:2205.01068, 2022
  2. “Llama 2: open Foundation and Fine-Tuned Chat Models” In arXiv preprint arXiv:2307.09288, 2023
  3. OpenAI “GPT-4 Technical Report” In arXiv preprint arXiv:2303.08774, 2024
  4. “Mixtral of Experts” In arXiv preprint arXiv:2401.04088, 2024
  5. “Gemma: open Models Based on Gemini Research and Technology” In arXiv preprint arXiv:2403.08295, 2024
  6. “Attention is all you need” In Advances in Neural Information Processing Systems 30, 2017
  7. “KVQuant: towards 10 Million Context Length LLM Inference with KV Cache Quantization” In arXiv preprint arXiv:2401.18079, 2024
  8. “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens” In arXiv preprint arXiv:2402.13753, 2024
  9. “GPTQ: accurate Post-Training Quantization for Generative Pre-trained Transformers” In arXiv preprint arXiv:2210.17323, 2023
  10. “SmoothQuant: accurate and Efficient Post-Training Quantization for Large Language Models” In arXiv preprint arXiv:2211.10438, 2023
  11. Y.E. Wang, G.-Y. Wei and D. Brooks “Benchmarking TPU, GPU, and CPU platforms for deep learning” In arXiv preprint arXiv:1907.10701, 2019
  12. “Bottleneck transformers for visual recognition” In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16519–16529
  13. “GOBO: quantizing attention-based NLP models for low latency and energy efficient inference” In IEEE/ACM International Symposium on Microarchitecture, 2020, pp. 811–824 IEEE
  14. “Q8BERT: Quantized 8bit BERT” In Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, 2019, pp. 36–39 IEEE
  15. “Q-BERT: hessian based ultra low precision quantization of BERT” In Proceedings of the AAAI Conference on Artificial Intelligence 34.05, 2020, pp. 8815–8821
  16. “TernaryBERT: distillation-aware ultra-low bit BERT” In arXiv preprint arXiv:2009.12812, 2020
  17. “Mixed precision training” In arXiv preprint arXiv:1710.03740, 2017
  18. “Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point” In Advances in neural information processing systems 33, 2020, pp. 10271–10281
  19. “Block Floating Point (BFP) for Efficient Deep Neural Net Inference” In IEEE P3109 Working Group, June 6, 2022
  20. “Block Format Error Bounds and Optimal Block Size Selection” In arXiv preprint arXiv:2210.05470, 2022
  21. Microsoft “MX Pytorch Emulation Library” In https://github.com/microsoft/microxcaling, 2023
  22. “Microscaling Data Formats for Deep Learning” In arXiv preprint arXiv:2310.10537, 2023
  23. “The Era of 1-bit LLMs: all large language lodels are in 1.58 bits” In arXiv preprint arXiv:2402.17764, 2024
  24. “FP8-lm: Training FP8 large language models” In arXiv preprint arXiv:2310.18313, 2023
  25. “FP8 formats for deep learning” In arXiv preprint arXiv:2209.05433, 2022
  26. “RoFormer: enhanced transformer with rotary position embedding” In Neurocomputing 568 Elsevier, 2024, pp. 127063
  27. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” In arXiv preprint arXiv:2208.07339, 2022
  28. “Pointer sentinel mixture models” In arXiv preprint arXiv:1609.07843, 2016
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Nikita Trukhanov (2 papers)
  2. Ilya Soloveychik (29 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com