Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment (2501.19309v1)

Published 31 Jan 2025 in cs.LG and cs.CL

Abstract: The performance of LLMs is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target. We thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8b/405B-Judge achieves a speedup of 9x over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to 141 tokens/s for 8B/70B-Judge and 129 tokens/s for 8B/405B on 2 and 8 H100s respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Gregor Bachmann (21 papers)
  2. Sotiris Anagnostidis (21 papers)
  3. Albert Pumarola (31 papers)
  4. Markos Georgopoulos (19 papers)
  5. Artsiom Sanakoyeu (25 papers)
  6. Yuming Du (16 papers)
  7. Edgar Schönfeld (21 papers)
  8. Ali Thabet (37 papers)
  9. Jonas Kohler (34 papers)

Summary

An Examination of Judge Decoding: Performance and Efficiency in Speculative Sampling

The paper "Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment" provides an insightful analysis and enhancement of speculative decoding (SD) for LLMs. The authors challenge the prevailing limitations of SD, particularly its dependency on alignment between draft and target model outputs, and propose a novel adaptive verification method using the concept of Judge Decoding. This work represents a significant contribution to the efficiency of autoregressive generation tasks by integrating a more sophisticated understanding of token appropriateness beyond mere probabilistic precision.

Overview and Methodological Innovation

At the heart of speculative decoding is the process whereby a draft model generates multiple candidate tokens that are then rapidly vetted by a more computationally intensive target model. The conventional SD aligns its outputs with that of the target model to ensure output fidelity, but this alignment imposes a significant constraint: high-quality draft tokens may still face rejection for misalignment despite their contextual and semantic validity, thereby impeding potential speed improvements. This research identifies and addresses these inefficiencies by questioning this strict alignment protocol, advocating for a more contextually sensitive verification scheme brought forth by judge decoding.

By introducing a compact judgment module atop the target model's embeddings, the paper exploits the rich representational capacity of token embeddings to assess token correctness. The method borrows from the LLM-as-a-judge framework, previously noted for its capability to evaluate responses with a high correlation to human judgment. This innovative application of token embeddings allows the verification process to recognize and accept contextually appropriate continuations, significantly optimizing the acceptance rate of candidate tokens without compromising on quality.

Empirical Insights and Numerical Outcomes

The authors present strong empirical results using Llama-3.1 models as both the draft and target models, notably achieving a 9×9\times speed improvement over standard metric benchmarks. This considerable boost reflects how their approach not only maintains target model performance but does so at an accelerated rate, an advance that persists even when leveraging optimized inference frameworks. The enhanced speed of 129 tokens per second, as reported, is noteworthy, especially given the computational intensity of large model inference.

The methodological rigor extends beyond mere speedups, encapsulating a robust evaluation across benchmarks traditionally used in SD studies, including GSM8K, HumanEval, and MT-Bench. Notably, this approach addresses an established gap in how SD has often been evaluated using un-optimized, general-purpose frameworks, reinforcing the operational and practical viability of judge decoding in real-world applications.

Implications and Future Directions

The implications of this work are multifold. Practically, the more efficient SD enables faster deployments of LLMs in production environments, a critical need given the rising demand for instantaneity in AI-driven applications. Theoretically, this research opens avenues to further refine token-level verification, potentially proposing a paradigm shift in how alignment and correctness are perceived and operationalized in LLMs. It underscores the possibility of training compact, task-agnostic verification layers to enrich LLM interpretability and responsiveness without full retraining.

Despite its contributions, the paper acknowledges constraints, such as the potential loss of mathematical guarantees for quality preservation and the challenge of maintaining performance in novel tasks without tailored datasets. Future work could explore enhancing the adaptability of these models to diverse semantic nuances and expanding the scope of verification modules across more varied and dynamically evolving use cases.

In summary, this paper provides a nuanced perspective on speculative sampling's capabilities, elegantly merging theoretical insight with pragmatic enhancements to redefine the boundaries of inference efficiency in large-scale AI models. Through judge decoding, the work sets a new benchmark for how LLMs can be leveraged with greater intelligence and agility, reflecting a deep understanding of both model architecture potential and operational requirements.

X Twitter Logo Streamline Icon: https://streamlinehq.com