APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding (2401.06761v1)
Abstract: The massive adoption of LLMs demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving. In this work, we introduce a parallel auto-regressive generation method. By instruct-tuning on general domain data that contains hierarchical structures, we enable LLMs to independently plan their generation process and perform auto-parallel auto-regressive (APAR) generation, significantly reducing the number of generation steps. APAR alone can achieve up to 2x speed-up, and when combined with speculative decoding, the speed-up can reach up to 4x. In addition, APAR reduces the key-value cache consumption and attention computation during generation. This leads to a throughput increase of 20-70% and a latency reduce of 20-35% in high-throughput scenarios, compared to state-of-the-art serving frameworks.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale.
- Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Flashattention: Fast and memory-efficient exact attention with io-awareness.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale.
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
- Non-autoregressive neural machine translation. In International Conference on Learning Representations.
- Reformer: The efficient transformer.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems.
- Yohei Nakajima. 2023. Babyagi. Python. https://github.com/yoheinakajima/babyagi.
- Skeleton-of-thought: Large language models can do parallel decoding.
- OpenAI. 2023. Gpt-4 technical report.
- Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA. Association for Computing Machinery.
- Improving language understanding by generative pre-training.
- Toran Bruce Richards. 2023. Auto-gpt: An autonomous gpt-4 experiment.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Inference with reference: Lossless acceleration of large language models.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA. USENIX Association.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.