Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs (2402.10517v4)
Abstract: Recently, considerable efforts have been directed towards compressing LLMs, which showcase groundbreaking capabilities across diverse applications but entail significant deployment costs due to their large sizes. Meanwhile, much less attention has been given to mitigating the costs associated with deploying multiple LLMs of varying sizes despite its practical significance. Thus, this paper introduces \emph{any-precision LLM}, extending the concept of any-precision DNN to LLMs. Addressing challenges in any-precision LLM, we propose a lightweight method for any-precision quantization of LLMs, leveraging a post-training quantization framework, and develop a specialized software engine for its efficient serving. As a result, our solution significantly reduces the high costs of deploying multiple, different-sized LLMs by overlaying LLMs quantized to varying bit-widths, such as 3, 4, ..., $n$ bits, into a memory footprint comparable to a single $n$-bit LLM. All the supported LLMs with varying bit-widths demonstrate state-of-the-art model quality and inference throughput, proving itself to be a compelling option for deployment of multiple, different-sized LLMs. Our code is open-sourced and available online.
- Generalized knowledge distillation for auto-regressive language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW.
- QuIP: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36, 2023.
- Accelerating large language model decoding with speculative sampling, 2023.
- Think you have solved question answering? try arc, the AI2 reasoning challenge, 2018.
- Automatic generation of high-performance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2020, pp. 305–316, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370479. doi: 10.1145/3368826.3377912. URL https://doi.org/10.1145/3368826.3377912.
- QLoRA: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023a.
- SpQR: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023b.
- SparseGPT: Massive language models can be accurately pruned in one-shot. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 10323–10337. PMLR, 2023. URL https://proceedings.mlr.press/v202/frantar23a.html.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- A survey of quantization methods for efficient neural network inference, 2021.
- MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ.
- Distilling the knowledge in a neural network, 2015.
- Distilling Step-by-Step! outperforming larger language models with less training data and smaller model sizes. In Rogers, A., Boyd-Graber, J. L., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 8003–8017. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.507. URL https://doi.org/10.18653/v1/2023.findings-acl.507.
- Mistral 7B, 2023.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. Advances in Neural Information Processing Systems, 36, 2023a.
- SqueezeLLM: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023b.
- Full stack optimization of transformer inference: a survey, 2023c.
- Speculative decoding with big little decoder. In Thirty-seventh Conference on Neural Information Processing Systems, 2023d.
- An llm compiler for parallel function calling, 2023e.
- OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
- Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- LLM-QAT: Data-free quantization aware training for large language models, 2023.
- LLM-Pruner: On the structural pruning of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=J8Ajf9WfXP.
- The Penn Treebank: annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pp. 114–119, USA, 1994. Association for Computational Linguistics. ISBN 1558603573. doi: 10.3115/1075812.1075835. URL https://doi.org/10.3115/1075812.1075835.
- Pointer sentinel mixture models, 2016.
- SpecInfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
- NVIDIA. TensorRT-LLM. URL https://github.com/NVIDIA/TensorRT-LLM.
- LUT-GEMM: Quantized matrix multiplication based on LUTs for efficient inference in large-scale generative language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gLARhFLE0F.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- WinoGrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, aug 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381.
- What matters in the structured pruning of generative language models?, 2023.
- PiQA: An algebra for querying protein data sets. In Proceedings of the 15th International Conference on Scientific and Statistical Database Management (SSDBM 2003), 9-11 July 2003, Cambridge, MA, USA, pp. 141–150. IEEE Computer Society, 2003. doi: 10.1109/SSDM.2003.1214975. URL https://doi.org/10.1109/SSDM.2003.1214975.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- turboderp. ExLlamaV2. URL https://github.com/turboderp/exllamav2.
- Work-in-progress: towards efficient quantized neural network inference on mobile devices. In 2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES), pp. 1–2, 2017. doi: 10.1145/3125501.3125528.
- Warren, H. S. Hacker’s Delight. Addison-Wesley Professional, 2nd edition, 2012. ISBN 0321842685.
- Any-precision deep neural networks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 10763–10771. AAAI Press, 2021. doi: 10.1609/AAAI.V35I12.17286. URL https://doi.org/10.1609/aaai.v35i12.17286.
- HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Mà rquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- LoRAPrune: Pruning meets low-rank parameter-efficient fine-tuning, 2023.
- OPT: Open pre-trained transformer language models, 2022.
- Decoupled knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 11943–11952. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01165. URL https://doi.org/10.1109/CVPR52688.2022.01165.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.