SPEED: Speculative Pipelined Execution for Efficient Decoding (2310.12072v2)
Abstract: Generative LLMs based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019.
- Language Models are Few-Shot Learners, 2020.
- Language Models are Unsupervised Multitask Learners. 2019.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2020.
- Full Stack Optimization of Transformer Inference: a Survey, 2023.
- OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator. Proceedings of Machine Learning and Systems, 2:363–378, 2020.
- Confident Adaptive Language Modeling, 2022.
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, 2020.
- Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers, 2021.
- Lessons on Parameter Sharing across Layers in Transformers, 2022.
- Scaling Up Models and Data with t5x and seqio, 2022.
- JAX: Composable Transformations of Python+NumPy Programs, 2018.
- Wikimedia Foundation. ACL 2019 Fourth Conference on Machine Translation (WMT19), Shared Task: Machine Translation of News.
- Teaching Machines to Read and Comprehend. In NIPS, pages 1693–1701, 2015.
- WikiLingua: A New Benchmark Bataset for Cross-Lingual Abstractive Summarization. arXiv preprint arXiv:2010.03093, 2020.
- Exploring Paracrawl for Document-level Neural Machine Translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1304–1310, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
- Universal Transformers, 2019.
- Accelerating Large Language Model Decoding with Speculative Sampling, 2023.
- Fast Inference from Transformers via Speculative Decoding, 2022.
- Big Little Transformer Decoder, 2023.
- You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model, 2023.
- SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics.
- SkyPilot: An Intercloud Broker for Sky Computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, Boston, MA, April 2023. USENIX Association.
- Coleman Hooper (16 papers)
- Sehoon Kim (30 papers)
- Hiva Mohammadzadeh (3 papers)
- Hasan Genc (9 papers)
- Kurt Keutzer (199 papers)
- Amir Gholami (60 papers)
- Sophia Shao (7 papers)