Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding (2402.02082v1)

Published 3 Feb 2024 in cs.CL

Abstract: Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model's confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Cunxiao Du (16 papers)
  2. Jing Jiang (192 papers)
  3. Xu Yuanchen (1 paper)
  4. Jiawei Wu (43 papers)
  5. Sicheng Yu (13 papers)
  6. Yongqi Li (40 papers)
  7. Shenggui Li (13 papers)
  8. Kai Xu (312 papers)
  9. Liqiang Nie (191 papers)
  10. Zhaopeng Tu (135 papers)
  11. Yang You (173 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.