2000 character limit reached
Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design (2409.01990v2)
Published 3 Sep 2024 in cs.DC and cs.LG
Abstract: As LLMs become popular, the need for efficient design for ML models on LLMs grows. We are amazed by the excellent output by the LLMs, yet we are still troubled with slow inference speed and large memory consumption of contemporary LLMs. This paper focuses on modern efficient inference technologies on LLMs and illustrates them from two perspectives: model and system design. These methodologies optimize LLM inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible.
- Amable (2021). Memory use of gpt-j 6b.
- Do deep nets really need to be deep? In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, pages 2654–2662. MIT Press.
- Gordon, M. (2020). Do we really need model compression?
- Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713.
- Efficient memory management for large language model serving with pagedattention. In SOSP ’23: Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100.
- Llm-pruner: On the structural pruning of large language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 21702–21720.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations.
- Dong Liu (266 papers)
- Zhixin Lai (11 papers)
- Yite Wang (5 papers)
- Jing Wu (182 papers)
- Yanxuan Yu (3 papers)
- Zhongwei Wan (39 papers)
- Benjamin Lengerich (6 papers)
- Ying Nian Wu (138 papers)