LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit (2405.06001v3)
Abstract: Recent advancements in LLMs are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardwares, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/LLMc.
- Quik: Towards end-to-end 4-bit inference on generative large language models, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Lsq+: Improving low-bit quantization through learnable offsets and better initialization, 2020.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Quip: 2-bit quantization of large language models with guarantees, 2024.
- Evaluating large language models trained on code. 2021.
- Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- The case for 4-bit precision: k-bit inference scaling laws, 2023.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
- Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation, 2024.
- Extreme compression of large language models via additive quantization, 2024.
- Learned step size quantization, 2020.
- Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models, 2023.
- Adaptive quantization of model updates for communication-efficient federated learning, 2021.
- Squeezellm: Dense-and-sparse quantization, 2024.
- Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
- Brecq: Pushing the limit of post-training quantization by block reconstruction, 2021.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Qllm: Accurate and efficient low-bitwidth quantization for large language models, 2024.
- Llm-qat: Data-free quantization aware training for large language models, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- ModelTC. Lightllm. https://github.com/ModelTC/lightllm, 2023.
- A white paper on neural network quantization, 2021.
- Nvidia. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023.
- Gpt-4 technical report, 2024.
- OpenPPL. Ppl-llm. https://github.com/openppl-public/ppl.nn.llm, 2023.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8815–8821, 2020.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024.
- Gptvq: The blessing of dimensionality for llm quantization, 2024.
- Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022.
- Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization, 2023a.
- Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023b.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
- Dual grained quantization: Efficient fine-grained quantization for llm, 2023.