Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recycled Attention: Efficient inference for long-context language models (2411.05787v1)

Published 8 Nov 2024 in cs.CL

Abstract: Generating long sequences of tokens given a long-context input imposes a heavy computational burden for LLMs. One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context LLMing tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.

Overview of "Recycled Attention: Efficient Inference for Long-Context LLMs"

The paper "Recycled Attention: Efficient inference for long-context LLMs" presents a novel approach aimed at reducing the computational burden associated with processing long sequences in LLMs. The primary contribution is the introduction of a method called Recycled Attention, which optimizes inference by alternating between full-context attention and attention over a subset of relevant tokens determined dynamically during generation steps. This approach addresses the inefficiencies caused by existing techniques that permanently evict tokens from the Key-Value (KV) cache.

Key Contributions

Recycled Attention offers a flexible approach to token selection, enhancing both performance and efficiency in long-context LLMs. Unlike previous methods that rely strictly on local contexts or maintain a fixed subset of tokens, this approach harnesses the attention pattern of recently processed tokens to infer the most crucial tokens for future steps. This method is particularly advantageous in scenarios requiring synthesis of non-local information, which is critical for long-context benchmarks and applications such as document retrieval or complex query answering.

Methodology

Recycled Attention maintains two KV caches: a full cache for sporadic full attention calculations and a dynamically adjusted recycled cache for partial attention steps. The process is as follows:

  • During pre-filling, the method computes full attention and initializes the recycled cache with the top K tokens based on attention scores.
  • In partial attention steps, the model attends over the recycled cache, crucially reducing the computational load by decreasing both attention computation and data movement.
  • Full attention steps are scheduled either at fixed intervals or dynamically, providing a balance between maintaining performance and minimizing computation.

Experimental Evaluation

The paper evaluates the Recycled Attention method across multiple tasks and models, comparing it with existing strategies such as StreamingLLM and H2_2O. Key results include:

  • On the RULER benchmark, Recycled Attention significantly outperforms the baselines in accuracy, achieving 63% accuracy for Llama-3.1-8B at a 32K input context length, compared to less than 25% for the baselines.
  • The research demonstrates reduced latency for inference while maintaining or improving perplexity scores across LLMing tasks in LLaMA and Qwen models.
  • Compared to baselines like StreamingLLM, Recycled Attention reaches a better trade-off between task performance and inference speed, particularly for tasks requiring synthesis of long-range dependencies.

Implications and Future Directions

Recycled Attention represents a meaningful advance in making LLMs more computationally efficient without sacrificing performance. The flexible and dynamic nature of token selection aligns token attention with task requirements, potentially influencing future architecture designs to adopt more dynamic and context-aware attention mechanisms.

The research opens pathways for further optimization of dynamic scheduling for full attention steps, and continued pre-training of models using Recycled Attention patterns could enhance efficiency further. Moreover, potential expansions of this methodology to other modalities, beyond LLMs, might also be considered, given the scalability and practicality demonstrated.

In conclusion, Recycled Attention addresses a critical challenge in LLM deployment by optimizing inference efficiency, particularly for long-context tasks. Its adaptability and reduced computation without considerable drops in performance highlight its potential for broader applications in AI, suggesting a promising avenue for improving the practicality of LLMs in resource-intensive scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Fangyuan Xu (10 papers)
  2. Tanya Goyal (24 papers)
  3. Eunsol Choi (76 papers)
Citations (1)