Block-Attention for Efficient Prefilling (2409.15355v5)

Published 14 Sep 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce Block-attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context in an auto-regressive manner. Instead, Block-attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-attention mechanism. Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of full-attention models, but can also seamlessly switch between the block and full attention modes without any performance loss. Notably, Block-attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the full-attention models, the TTFT and corresponding FLOPs are reduced by 98.7% and 99.8%, respectively. Additionally, in Appendix A, we elaborate on how Block-attention is applied in Game AI scenario and the substantial potential benefits it entails. We strongly suggest researchers in the gaming field not to overlook this section.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that Block-Attention significantly reduces TTFT by up to 98.7% while maintaining or improving model accuracy.
It introduces a novel architecture that segments input sequences into blocks to independently compute and cache KV states, reducing redundant calculations.
Empirical evaluations on four RAG benchmarks show that the approach boosts inference efficiency and accuracy, promising improved real-time applications.

Block-Attention for Low-Latency RAG

The paper, "Block-Attention for Low-Latency RAG," presents a novel attention mechanism aimed at optimizing inference latency in Retrieval-Augmented Generation (RAG) scenarios. Traditionally, RAG employs external retrieval mechanisms to source relevant passages, which are then incorporated into the input prompts of LLMs. While this approach mitigates knowledge hallucination and enhances the domain-specific expertise of LLMs, it also leads to significantly longer input sequences, thereby escalating the time to first token (TTFT).

Main Contributions and Implementation

Block-Attention divides the input sequence into blocks, with each block independently calculating its key-value (KV) states through self-attention, except for the final block which attends to all preceding blocks. This approach effectively pre-computes and caches the KV states for all retrieved passages in memory, minimizing the redundant computation during inference.

Key implementation steps include:

Segmenting the input sequence into blocks.
Calculating positional encodings for each block.
Fine-tuning the LLM to adapt to the Block-Attention mechanism.

The significance of this approach is demonstrated through extensive experiments on four RAG benchmarks—Natural Questions (NQ), TriviaQA (TQA), HotpotQA (HQA), and 2WikiMultiHopQA (2Wiki).

Numerical Results

The Block-Attention model achieves performance metrics comparable to or even slightly surpassing traditional self-attention models. For instance, on the Llama3 model, Block-Attention reported an average accuracy of 68.4%, compared to 67.9% with self-attention, and on the Mistral model, it reported 62.8% versus 59.6%. Most notably, Block-Attention considerably reduces TTFT by up to 98.7% and first token floating-point operations (Flops) by 99.8% for sequences up to 32K tokens in length.

Theoretical and Practical Implications

Theoretically, Block-Attention proposes an innovative approach to address the context-dependency of KV states in autoregressive models, facilitating their reuse across different queries. This decoupling of sequential dependencies sets a precedent for future explorations in optimizing autoregressive transformations. Furthermore, Block-Attention prompts a reconsideration of dividing input sequences into independent blocks, a strategy which may extend beyond RAG applications to other domains such as code generation and multi-turn dialogues.

Practically, Block-Attention has significant implications for the efficiency of RAG-based systems. By allowing pre-computation and caching of KV states, the mechanism resolves the latency issues inherent in processing long input sequences. This is particularly beneficial for dynamic RAG paradigms like ReAct, which necessitate multiple iterations of retrieval and re-encoding of passages. Developers can now enhance user experience by increasing the number of retrieved passages without compromising on responsiveness, thereby maximizing the model's accuracy and utility in real-time applications.

Future Developments

Future research directions could investigate the broader application of Block-Attention across various AI and NLP tasks. This includes exploring its potential in other dynamically adaptive systems and understanding the scalability of such modular attention mechanisms. Additionally, empirical studies could explore the hypothesized superiority of Block-Attention over self-attention in contexts where input segments are semantically independent.

Conclusion

The paper outlines a viable solution to a prominent challenge faced in RAG models, achieving a significant reduction in inference latency while maintaining high accuracy levels. The empirical results strongly support the implementation of Block-Attention in real-world applications, providing a sophisticated yet practical tool for enhancing the performance and efficiency of LLMs. The benefits of Block-Attention, particularly its ability to massively accelerate TTFT and Flops, underscore its importance in advancing the capabilities of RAG and similar adaptive AI architectures.