Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization (2411.02355v1)

Published 4 Nov 2024 in cs.LG and cs.AI
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Abstract: Despite the popularity of LLM quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Performance-Accuracy Trade-Offs in LLM Quantization

The ongoing evolution of LLMs has been accompanied by significant computational and operational challenges, particularly at inference time. The paper "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization addresses this challenge by examining the intricacies of model quantization as a means to enhance inference efficiency without compromising model accuracy. This empirical paper focuses on a rich set of quantization formats—FP8, INT8, and INT4—evaluated across a broad spectrum of academic and real-world benchmarks using the Llama-3.1 model family.

Central to the paper is the exploration of the accuracy-performance trade-offs inherent in model quantizations. The paper highlights an extensive evaluation involving over 500,000 assessments and provides significant insights:

  1. FP8 Quantization Efficacy: The paper finds that FP8 quantization (W8A8-FP) is lossless across various model scales, thereby enabling the retention of the original model’s accuracy while making it inference-ready with reduced operational requirements.
  2. INT8 Performance: Properly tuned INT8 quantization (W8A8-INT) demonstrates a surprisingly small accuracy degradation, maintaining just a 1-3\% loss on average. This is particularly noteworthy as previous conceptions indicated significant losses when using INT8 quantized activations.
  3. Competitive INT4 Quantization: INT4 weight-only quantization (W4A16-INT) reveals competitive performance compared to its 8-bit counterpart in specific scenarios, challenging previous stances that underscored considerable accuracy sacrifices with lower-bit quantization.

In addition to theoretical evaluations, the paper ventures into pragmatic areas, particularly regarding inference performance, using the vLLM framework across various GPU architectures. This exploration reveals that despite different hardware requirements and task demands, quantization can be optimized for different deployment environments. W4A16, for instance, demonstrated cost-efficiency advantages in synchronous deployments, while W8A8 was advantageous for asynchronous deployments on advanced GPUs.

The paper's depth in bridging the gap between theoretical accuracy and practical deployment capability provides several guidelines for efficient deployment of quantized LLMs. The key takeaway remains that with considered quantization strategies, significant computational savings can be realized without compromising the qualitative outputs expected from LLMs.

Implications and Future Directions

The findings underscore the potential of model quantization for broad applications, especially in democratizing access to LLM capabilities by reducing inference costs. The demonstrated efficacy of these quantization approaches could inspire further advancements in inference acceleration and reduced resource consumption, likely stimulating new research into compression algorithms.

Future work may explore more complex deployment scenarios, emphasizing multi-modal tasks and diverse architectures beyond GPUs. Furthermore, as LLMs continue to grow in size and application bandwidth increases, there might be a need for more nuanced quantization strategies that intelligently adapt to task-specific requirements or alternate between precision levels dynamically depending on contextual needs.

In summary, this paper provides a comprehensive benchmark of quantization methodologies, offering a detailed reference that practitioners and researchers can leverage to optimize LLM deployments. By doing so, it also lays a foundation for future works aimed at improving quantization techniques and expanding their applicability across various machine learning and artificial intelligence domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259, 2023.
  2. Quarot: Outlier-free 4-bit inference in rotated llms, 2024. URL https://arxiv.org/abs/2404.00456.
  3. Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
  4. Quip: 2-bit quantization of large language models with guarantees, 2023.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. Evaluating large language models trained on code. 2021.
  7. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. FlashAttention: Fast and memory-efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
  11. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  12. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
  13. SpQR: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  14. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  15. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024.
  16. FlashInfer, Z. Y. Kernel library for llm serving, 2023.
  17. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.
  18. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  19. Marlin: Mixed-precision auto-regressive parallel inference on large language models. arXiv preprint arXiv:2408.11743, 2024.
  20. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  21. Llmc: Benchmarking large language model quantization with a versatile compression toolkit, 2024a. URL https://arxiv.org/abs/2405.06001.
  22. What makes quantization for large language model hard? an empirical study from the lens of perturbation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  18082–18089, 2024b.
  23. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  24. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
  25. How good are low-bit quantized llama3 models? an empirical study, 2024.
  26. HuggingFace. Text generation inference (tgi), 2024. URL https://huggingface.co/docs/text-generation-inference/en/index.
  27. Karpathy, A. Tweet about training neural networks, 2024. URL https://x.com/karpathy/status/1822839061574553945. [Tweet].
  28. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  29. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  30. Lambda Labs. Lambda labs gpu cloud, 2024. URL https://lambdalabs.com/service/gpu-cloud. Accessed: 2024-10-28.
  31. Platypus: Quick, cheap, and powerful refinement of llms. 2023.
  32. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models, 2024a.
  33. A comprehensive evaluation of quantized instruction-tuned large language models: An experimental analysis up to 405b. arXiv preprint arXiv:2409.11055, 2024b.
  34. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  35. Evaluating quantized large language models. arXiv preprint arXiv:2402.18158, 2024a.
  36. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024b.
  37. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024c. URL https://lmsys.org/blog/2024-04-19-arena-hard/.
  38. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  39. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024a.
  40. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  41. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024b.
  42. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
  43. Do emergent abilities exist in quantized large language models: An empirical study. arXiv preprint arXiv:2307.08072, 2023b.
  44. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024. URL https://arxiv.org/abs/2407.14679.
  45. Neural Magic, I. Guidellm: Scalable inference and optimization for large language models. https://github.com/neuralmagic/guidellm, 2024.
  46. NVIDIA. TensorRT-LLM: TensorRT Large Language Model, 2023. URL https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2024-10-27.
  47. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  48. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  49. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
  50. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  51. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/abs/2310.16049.
  52. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261.
  53. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024a. URL https://arxiv.org/abs/2402.04396.
  54. Qtip: Quantization with trellises and incoherence processing. arXiv preprint arXiv:2406.11235, 2024b.
  55. Gptvq: The blessing of dimensionality for llm quantization. arXiv preprint arXiv:2402.15319, 2024.
  56. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  57. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574.
  58. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  59. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  60. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  61. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302, 2023.
  62. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  63. Zhang, J. Y. Tweet about machine learning quantization accuracy drops, 2024. URL https://x.com/zjasper666/status/1829259315599045063. [Tweet].
  64. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  65. Qqq: Quality quattuor-bit quantization for large language models. arXiv preprint arXiv:2406.09904, 2024.
  66. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Eldar Kurtic (20 papers)
  2. Alexandre Marques (6 papers)
  3. Shubhra Pandit (2 papers)
  4. Mark Kurtz (6 papers)
  5. Dan Alistarh (133 papers)

HackerNews