Speculative Sampling via Exponential Races (2504.15475v1)

Published 21 Apr 2025 in cs.CL, cs.IT, and math.IT

Abstract: Speculative decoding accelerates LLM inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$, which serves as an upper bound for all $k$. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Speculative Sampling via Exponential Races (2504.15475v1)

Summary

Follow-up Questions

Related Papers