S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models (2407.01955v1)
Abstract: Deployment of autoregressive LLMs is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token generation process and reduce costs. Speculative decoding (SD) is among the most promising approaches to speed up the LLM decoding process by verifying multiple tokens in parallel and using an auxiliary smaller draft model to generate the possible tokens. In SD, usually, one draft model is used to serve a specific target model; however, in practice, LLMs are diverse, and we might need to deal with many target models or more than one target model simultaneously. In this scenario, it is not clear which draft model should be used for which target model, and searching among different draft models or training customized draft models can further increase deployment costs. In this paper, we first introduce a novel multi-target scenario for the deployment of draft models for faster inference. Then, we present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings. We evaluated our method on Spec-Bench in different settings, including base models such as Vicuna 7B, 13B, and LLama Chat 70B. Our results suggest that our draft models perform better than baselines for multiple target models at the same time.
- Hydra: Sequentially-dependent draft heads for medusa decoding. arXiv preprint arXiv:2402.05109.
- Anthropic. 2024. Model card claude 3. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
- Jointly-learned exit and inference for a dynamic neural network. In The Twelfth International Conference on Learning Representations.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Cascade speculative drafting for even faster llm inference. arXiv preprint arXiv:2312.11462.
- Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710.
- Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057.
- Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252.
- Sorted llama: Unlocking the potential of intermediate layers of large language models for dynamic inference. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2129–2145.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Fast Inference from Transformers via Speculative Decoding. arXiv preprint. ArXiv:2211.17192 [cs].
- Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077.
- Panda: Preference adaptation for enhancing domain-specific abilities of llms. arXiv preprint arXiv:2402.12835.
- Online speculative decoding. arXiv preprint arXiv:2310.07177.
- Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 1(2):4.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31.
- Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970.
- Spectr: Fast speculative decoding via optimal transport. In Advances in Neural Information Processing Systems, volume 36, pages 30222–30242. Curran Associates, Inc.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Sortednet, a place for every network and every network in its place: Towards a generalized solution for training many-in-one neural networks. arXiv preprint arXiv:2309.00255.
- Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite. Preprint, arXiv:2310.18581.
- Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925.
- Lossless speedup of autoregressive translation with generalized aggressive decoding. arXiv preprint arXiv:2203.16487.
- Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851.
- Generation meets verification: Accelerating large language model inference with smart parallel auto-correct decoding. arXiv preprint arXiv:2402.11809.
- Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Preprint, arXiv:2306.14048.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Wei Zhong and Manasa Bharadwaj. 2024. S3d: A simple and cost-effective self-speculative decoding scheme for low-memory gpus. arXiv preprint arXiv:2405.20314.
- Parsa Kavehzadeh (7 papers)
- Mohammadreza Pourreza (12 papers)
- Mojtaba Valipour (8 papers)
- Tinashu Zhu (1 paper)
- Haoli Bai (24 papers)
- Ali Ghodsi (73 papers)
- Boxing Chen (67 papers)
- Mehdi Rezagholizadeh (78 papers)