Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding (2502.05609v1)

Published 8 Feb 2025 in cs.CL

Abstract: Accelerating inference in LLMs is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

PDF Abstract

Lossless Acceleration of LLMs with Hierarchical Drafting

The paper "Lossless Acceleration of LLMs with Hierarchical Drafting based on Temporal Locality in Speculative Decoding" addresses a critical concern in deploying LLMs for real-time applications: the need to accelerate inference processes without sacrificing accuracy. The researchers propose Hierarchy Drafting (HD), a novel speculative decoding strategy that organizes token drafting based on temporal locality, designed to optimize inference speed and consistency across diverse tasks.

Overview of Hierarchy Drafting (HD)

The primary innovation in this paper is the introduction of Hierarchy Drafting, a method that eschews the need for fine-tuning or retraining models, unlike many existing approaches. Instead, HD utilizes a hierarchical framework to improve token drafting by organizing various potential token sources into structured databases. This hierarchy is built around the concept of temporal locality, which refers to the tendency for certain tokens or sequences to recur within a narrow temporal or contextual window.

HD separates tokens into three distinct databases:

Context-dependent Database (CD): Tokens with high relevance to the specific context or text generation process, exhibiting very high temporal locality.
Model-dependent Database (MD): Phrases frequently generated by the LLM itself, reflecting moderate temporal locality across different processes.
Statistics-dependent Database (SD): Universally frequent phrases derived from large text corpora, having the least locality across processes.

These databases are accessed sequentially from the highest (CD) to lowest (SD) temporal locality, ensuring that the draft tokens are aligned as closely as possible with the actual model output.

Technical Contributions

Integration of Multiple Sources: By combining diverse token sources into a single hierarchical drafting process, HD consistently outperforms traditional speculative decoding methods that rely on a singular token source. This integration mitigates the inconsistent performance that often results from task-specific token distributions.
Efficient Use of Temporal Locality: By organizing the drafting process around temporal locality, HD optimizes both drafting accuracy and latency. Tokens with higher locality are prioritized, resulting in a higher acceptance rate and less computational overhead.
Plug-and-Play Framework: HD's modular design allows easy integration of additional database sources. This plug-and-play nature ensures that as new methods or datasets become available, they can be incorporated into the existing framework without significant reconfiguration.

Results and Implications

The experiments demonstrate that HD achieves significant inference speedups across several models, including Llama-2 and Vicuna, and maintains robust performance across various tasks such as translation, conversation, and question answering. These results underscore the practical utility of HD in environments where consistency and acceleration are crucial.

The acceptance ratio of the drafted tokens in HD was found to surpass 70% in most scenarios, translating into faster token generation without the need for additional model training. This positions HD as a viable alternative to more resource-intensive methods, especially in settings where fine-tuning large models is impractical.

Future Directions

This paper's findings open several avenues for future research. The potential for extending HD with more sophisticated temporal locality metrics or integrating it with hardware-level optimizations could yield even further speed improvements. Additionally, exploring HD's application in other LLM-based tasks, such as real-time dialogue systems or adaptive content generation, could provide further evidence of its versatility and efficiency.

In conclusion, HD offers a compelling strategy for enhancing LLM inference performance by leveraging token temporal locality, providing an efficient solution for real-time applications that demand both speed and accuracy.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Sukmin Cho (17 papers)
Sangjin Choi (1 paper)
Taeho Hwang (4 papers)
Jeongyeon Seo (5 papers)
Soyeong Jeong (22 papers)
Huije Lee (9 papers)
Hoyun Song (10 papers)
Jong C. Park (28 papers)
Youngjin Kwon (12 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1890220734062162208

https://twitter.com/javaeeeee1/status/1889245988663025842

https://twitter.com/arXivGPT/status/1889737653874028885