Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Published 17 Apr 2024 in cs.CL and cs.AI | (2404.11502v1)

Abstract: In real world, LLMs can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work, and numerous optimization algorithms and code libraries have been proposed to improve it. Nonetheless, users still find it challenging to compare the effectiveness of all the above methods and understand the underlying mechanisms. In this work, we perform a detailed coarse-to-fine analysis of the inference performance of various code libraries. To evaluate the overall effectiveness, we examine four usage scenarios within two practical applications. We further provide both theoretical and empirical fine-grained analyses of each module in the Transformer architecture. Our experiments yield comprehensive results that are invaluable for researchers to evaluate code libraries and improve inference strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  2. Hugging Face Contributors. huggingface/text-generation-inference: Large language model text generation inference. https://github.com/huggingface/text-generation-inference, 2023a.
  3. LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023b.
  4. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, 2023.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Proc. of NeurIPS, 2022.
  6. Georgi Gerganov. ggerganov/llama.cpp: Llm inference in c/c++. https://github.com/ggerganov/llama.cpp, 2023.
  7. Full stack optimization of transformer inference: a survey. CoRR, abs/2302.14017, 2023. doi: 10.48550/ARXIV.2302.14017. URL https://doi.org/10.48550/arXiv.2302.14017.
  8. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, 2023.
  9. GPGPU: general purpose computation on graphics hardware. In International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2004, Los Angeles, California, USA, August 8-12, 2004, Course Notes, pp.  33. ACM, 2004. doi: 10.1145/1103900.1103933. URL https://doi.org/10.1145/1103900.1103933.
  10. Towards efficient generative large language model serving: A survey from algorithms to systems. CoRR, abs/2312.15234, 2023. doi: 10.48550/ARXIV.2312.15234. URL https://doi.org/10.48550/arXiv.2312.15234.
  11. Teams Microsoft. microsoft/deepspeed-mii: Mii makes low-latency and high-throughput inference possible, powered by deepspeed. https://github.com/microsoft/DeepSpeed-MII, 2023.
  12. Teams ModelTC. Modeltc/lightllm: Lightllm is a python-based llm (large language model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. https://github.com/ModelTC/lightllm, 2023.
  13. Teams NVIDIA. Nvidia/fastertransformer: Transformer related optimization, including bert, gpt. https://github.com/NVIDIA/FasterTransformer, 2021.
  14. Teams NVIDIA. Nvidia/tensorrt-llm: Tensorrt-llm provides users with an easy-to-use python api to define large language models (llms) and build tensorrt engines that contain state-of-the-art optimizations to perform inference efficiently on nvidia gpus. tensorrt-llm also contains components to create python and c++ runtimes that execute those tensorrt engines. https://github.com/NVIDIA/TensorRT-LLM, 2023.
  15. Efficiently scaling transformer inference. CoRR, abs/2211.05102, 2022. doi: 10.48550/ARXIV.2211.05102. URL https://doi.org/10.48550/arXiv.2211.05102.
  16. Teams ShareGPT. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/, 2023.
  17. Noam Shazeer. GLU variants improve transformer. abs/2002.05202, 2020.
  18. Inferflow: an efficient and highly configurable inference engine for large language models, 2024.
  19. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  20. Llama: Open and efficient foundation language models. CoRR, 2023a.
  21. Llama 2: Open foundation and fine-tuned chat models. CoRR, 2023b.
  22. Attention is all you need. In Proc. of NeurIPS, 2017.
  23. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  24. Efficient streaming language models with attention sinks. arXiv, 2023.
  25. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  12360–12371, 2019.
  26. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. URL http://arxiv.org/abs/2303.18223.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.