Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Published 21 Feb 2025 in cs.CL and cs.AI | (2502.15294v3)

Abstract: The increasing context window size in LLMs has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users on the granularity of round and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. Based on this, we propose Round Attention - a novel round-level attention mechanism that selectively processes the KV cache of top-k relevant rounds, where k is dynamically determined through the attention matrix in the watershed layer. Theoretical analysis demonstrates that our method reduces memory usage by 54\% to 82\%, while experimental results confirm that loading sparse critical-round KV cache maintains answer accuracy without performance degradation.

Abstract PDF Upgrade to Chat

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (7)

Collections

Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections