SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference (2307.02628v1)
Abstract: Autoregressive LLMs have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To address this issue, several approaches have been proposed to reduce computational cost using early-exit strategies. These strategies enable faster text generation using reduced computation without applying the full computation graph to each token. While existing token-level early exit methods show promising results for online inference, they cannot be readily applied for batch inferencing and Key-Value caching. This is because they have to wait until the last token in a batch exits before they can stop computing. This severely limits the practical application of such techniques. In this paper, we propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens. Rather than terminating computation prematurely as in prior works, our approach bypasses lower to middle layers, devoting most of the computational resources to upper layers, allowing later tokens to benefit from the compute expenditure by earlier tokens. Our experimental results show that SkipDecode can obtain 2x to 5x inference speedups with negligible regression across a variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7 billion parameters, all the while being directly compatible with batching and KV caching optimization techniques.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019.
- Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014.
- Compression of deep learning models for text: A survey. ACM Trans. Knowl. Discov. Data, 16(4):61:1–61:55, 2022. doi: 10.1145/3487045. URL https://doi.org/10.1145/3487045.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
- The curious case of neural text degeneration. CoRR, abs/1904.09751, 2019.
- Dynabert: Dynamic BERT with adaptive width and depth. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6f5216f8d89b086c18298e043bfe48ed-Abstract.html.
- Tinybert: Distilling bert for natural language understanding, 2019.
- Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 475–486. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.43. URL https://doi.org/10.18653/v1/2021.findings-emnlp.43.
- Fastbert: a self-distilling BERT with adaptive inference time. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6035–6044. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.537. URL https://doi.org/10.18653/v1/2020.acl-main.537.
- The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
- Confident adaptive language modeling. CoRR, abs/2207.07061, 2022.
- A simple hash-based early exiting approach for language understanding and generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2409–2421. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-acl.189. URL https://doi.org/10.18653/v1/2022.findings-acl.189.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
- Efficient methods for natural language processing: A survey. CoRR, abs/2209.00099, 2022. doi: 10.48550/arXiv.2209.00099. URL https://doi.org/10.48550/arXiv.2209.00099.
- Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
- Deebert: Dynamic early exiting for accelerating BERT inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2246–2251. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.204. URL https://doi.org/10.18653/v1/2020.acl-main.204.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
- BERT loses patience: Fast and robust inference with early exit. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/d4dd111a4fd973394238aca5c05bebe3-Abstract.html.
- Wei Zhu. Leebert: Learned early exit for BERT with cross-level optimization. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2968–2980. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.231. URL https://doi.org/10.18653/v1/2021.acl-long.231.
- Luciano Del Corro (9 papers)
- Allie Del Giorno (4 papers)
- Sahaj Agarwal (6 papers)
- Bin Yu (167 papers)
- Ahmed Awadallah (27 papers)
- Subhabrata Mukherjee (59 papers)