Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens (2402.15758v2)

Published 24 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the auto-regressive decoding approach. In light of these limitations, we propose Chimera, a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. To ensure both accuracy and efficiency, we present two strategies within the lightweight draft model. Firstly, we focus on capturing short-range dependencies at the bottom layer. Secondly, we leverage the readily available representations from the original LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera demonstrates impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach. This highlights the potential of our proposed framework in significantly improving the efficiency of LLMs during the decoding process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Aeala. 2022. Sharegpt_vicuna_unfiltered. https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered.
  3. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609.
  6. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv: 2401.10774.
  7. Accelerating large language model decoding with speculative sampling.
  8. Cascade speculative drafting for even faster llm inference.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Break the sequential dependency of llm inference using lookahead decoding.
  11. Speed: Speculative pipelined execution for efficient decoding.
  12. What does BERT learn about the structure of language? In Proceedings of ACL, pages 3651–3657.
  13. On the computational complexity of self-attention. In ALT, volume 201, pages 597–619.
  14. Speculative decoding with big little decoder.
  15. Fast inference from transformers via speculative decoding. In Proceedings of ICML, volume 202, pages 19274–19286.
  16. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.
  17. Pass: Parallel speculative sampling.
  18. A primer in bertology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
  19. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12336–12355, Toronto, Canada. Association for Computational Linguistics.
  20. Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding.
  21. Blockwise parallel decoding for deep autoregressive models. In Proceedings of NeurIPS, volume 31.
  22. Spectr: Fast speculative decoding via optimal transport.
  23. Llama 2: Open foundation and fine-tuned chat models.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  25. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation.
  26. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851.
  27. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  28. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding.
  29. Draft &\&& verify: Lossless large language model acceleration via self-speculative decoding.
  30. Qinyuan Zheng. 2023. WKU_NLP at SemEval-2023 task 9: Translation augmented multilingual tweet intimacy analysis. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1525–1530, Toronto, Canada. Association for Computational Linguistics.
  31. Distillspec: Improving speculative decoding via knowledge distillation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ziqian Zeng (32 papers)
  2. Jiahong Yu (4 papers)
  3. Qianshi Pang (2 papers)
  4. Zihao Wang (216 papers)
  5. Huiping Zhuang (43 papers)
  6. Xiaofeng Zou (2 papers)
  7. HongEn Shao (4 papers)
Citations (3)