Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding (2401.06761v1)

Published 12 Jan 2024 in cs.CL

Abstract: The massive adoption of LLMs demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving. In this work, we introduce a parallel auto-regressive generation method. By instruct-tuning on general domain data that contains hierarchical structures, we enable LLMs to independently plan their generation process and perform auto-parallel auto-regressive (APAR) generation, significantly reducing the number of generation steps. APAR alone can achieve up to 2x speed-up, and when combined with speculative decoding, the speed-up can reach up to 4x. In addition, APAR reduces the key-value cache consumption and attention computation during generation. This leads to a throughput increase of 20-70% and a latency reduce of 20-35% in high-throughput scenarios, compared to state-of-the-art serving frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale.
  2. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Flashattention: Fast and memory-efficient exact attention with io-awareness.
  5. Llm.int8(): 8-bit matrix multiplication for transformers at scale.
  6. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot.
  7. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
  8. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
  9. Reformer: The efficient transformer.
  10. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  11. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  12. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems.
  13. Yohei Nakajima. 2023. Babyagi. Python. https://github.com/yoheinakajima/babyagi.
  14. Skeleton-of-thought: Large language models can do parallel decoding.
  15. OpenAI. 2023. Gpt-4 technical report.
  16. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA. Association for Computing Machinery.
  17. Improving language understanding by generative pre-training.
  18. Toran Bruce Richards. 2023. Auto-gpt: An autonomous gpt-4 experiment.
  19. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
  20. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
  21. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  22. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  23. Inference with reference: Lossless acceleration of large language models.
  24. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA. USENIX Association.
  25. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  26. Judging llm-as-a-judge with mt-bench and chatbot arena.
  27. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
Citations (7)

Summary

We haven't generated a summary for this paper yet.