H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (2306.14048v3)

Published 24 Jun 2023 in cs.LG

Abstract: LLMs, despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.

PDF HTML Abstract

This work tackles a practical challenge in using LLMs for generating long texts. One major issue with these models is that during generation they must save many intermediate “key” and “value” representations (together known as the KV cache) that are used to calculate what the model should say next. As a conversation or story grows longer, the memory needed to store all these KV pairs increases linearly, and that can be very demanding in terms of computer memory and speed.

Below is an explanation of the main ideas behind the approach:

Understanding the KV Cache Problem

During text generation, LLMs compute an “attention” score for every previous token when predicting a new one. This means if you have, say, a thousand tokens already, the system must look at all thousand of them again at each step.
The intermediate results for each token, the key–value embeddings, are stored in a cache. The memory required grows with the number of tokens, which makes inference on long texts expensive and sometimes impractical.

Key Observations That Inspire the Approach

Although the models are trained with full dense attention (looking at every token in the past), in practice the attention scores tend to be very “sparse.” In other words, only a small fraction of the previous tokens really matter for predicting the next word.
Empirically, it is observed that the aggregated attention scores follow a power-law distribution. This means that a few tokens (called “heavy hitters”) contribute most of the value when the model computes what to say next.

The Heavy-Hitter Oracle (H2O) Approach

Recognizing that only a small group of tokens is really influential, the proposed method constructs a smarter eviction strategy for the KV cache. Instead of keeping all tokens or simply keeping only the most recent ones, H2O dynamically decides which tokens are the “heavy hitters.”
At every generation step, the method examines the attention scores and uses a greedy algorithm to determine which token to remove from the cache if necessary. The greedy algorithm is chosen because it is efficient and, under certain assumptions about the structure of attention (specifically, if it behaves like a submodular function), it can be nearly optimal.
The strategy cleverly blends the retention of both recent tokens and those that have been observed to carry high attention scores. In practice, even when the cache size is reduced (down to only 20% of the original memory requirement), this method manages to maintain the quality of the text generated.

Benefits and Theoretical Guarantees

Thanks to this approach, the memory footprint of the KV cache can be reduced significantly, leading to faster and more efficient inference.
The researchers provide a formal statement showing that, under some mild assumptions, the greedy algorithm’s performance is close to that of an ideal strategy. The idea is formulated as a “dynamic submodular maximization” problem, which is a way to mathematically capture the idea of efficiently choosing a small but effective subset.
Experimental results on several families of LLMs (like OPT, LLaMA, and GPT-NeoX) and across different tasks show that not only does H2O help to reduce memory use but it also improves the throughput (tokens generated per second) quite dramatically, all while preserving or even slightly improving the quality of the generated text.

Practical Implications

This method is particularly important for applications that require long-context generation such as dialogue systems, story writing, or summarizing long documents.
By reducing the memory burden, systems can run faster and more economically, making advanced LLMs more accessible for a wider range of applications.

Overall, the paper presents an innovative solution to a key bottleneck in deploying LLMs. By identifying and retaining the critical “heavy hitter” tokens within the KV cache, the H2O approach allows for efficient text generation, reducing memory usage and increasing speed without sacrificing performance.

PDF Markdown Bookmark Chat (Pro)

References (144)

Authors (12)

Zhenyu Zhang (249 papers)
Ying Sheng (31 papers)
Tianyi Zhou (172 papers)
Tianlong Chen (202 papers)
Lianmin Zheng (34 papers)
Ruisi Cai (11 papers)
Zhao Song (253 papers)
Yuandong Tian (128 papers)
Christopher Ré (194 papers)
Clark Barrett (86 papers)
Zhangyang Wang (374 papers)
Beidi Chen (61 papers)

Citations (140)

View on Semantic Scholar

GitHub

GitHub - FMInference/H2O: [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. (353 stars)

Tweets

https://twitter.com/1702165411083739136/status/1741287885092098291

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (2306.14048v3)

Related Papers

GitHub

Tweets