SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens (2403.18647v2)
Abstract: We propose an acceleration scheme for LLMs through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising the model's accuracy. The core strategies involve: 1) Fine-tune the model by incorporating semantic adaptive tokens that possess flexible decoding capabilities without changing its structure, allowing them to generate high-quality draft tokens. 2) By employing a training method that does not affect the standard tokens, the model can acquire parallel decoding abilities atop its original framework with minimal training overhead. 3) We have designed the "two-step-draft-then-verify" generation strategies using both greedy search and nucleus sampling. Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively. Please refer to https://github.com/hasuoshenyun/SDSAT.
- Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Medusa: Simple framework for accelerating llm generation with multiple decoding heads, 2023.
- Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024.
- Decoupled weight decay regularization. Learning,Learning, Nov 2017.
- Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification, 2024.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Accelerating transformer inference for translation via parallel decoding. arXiv preprint arXiv:2305.10427, 2023.
- Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
- Blockwise parallel decoding for deep autoregressive models, 2018.
- Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3909–3925, 2023.
- Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023.
- Chengbo Liu (2 papers)
- Yong Zhu (33 papers)