Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion (2312.07305v1)

Published 12 Dec 2023 in cs.CL and cs.AI

Abstract: Sparse attention as a efficient method can significantly decrease the computation cost, but current sparse attention tend to rely on window self attention which block the global information flow. For this problem, we present Shifted Cross Chunk Attention (SCCA), using different KV shifting strategy to extend respective field in each attention layer. Except, we combine Dilated Attention(DA) and Dilated Neighborhood Attention(DNA) to present Shifted Dilated Attention(SDA). Both SCCA and SDA can accumulate attention results in multi head attention to obtain approximate respective field in full attention. In this paper, we conduct LLMing experiments using different pattern of SCCA and combination of SCCA and SDA. The proposed shifted cross chunk attention (SCCA) can effectively extend LLMs to longer context combined with Positional interpolation(PI) and LoRA than current sparse attention. Notably, SCCA adopts LLaMA2 7B from 4k context to 8k in single V100. This attention pattern can provide a Plug-and-play fine-tuning method to extend model context while retaining their original architectures, and is compatible with most existing techniques.

Summary

  • The paper introduces SCCA, which shifts key-value pairs across chunks to enhance long-context processing in Transformers.
  • It combines SCCA with SDA, Positional Interpolation, and LoRA to extend receptive fields without altering model architecture.
  • Experiments demonstrate improved perplexity on PG19 and Proof datasets, confirming enhanced efficiency in handling extended contexts.

Introduction to Sparse Attention Mechanisms

The Transformer architecture has gained widespread use in LLMs. However, the challenge remains in efficiently processing long context sequences, as traditional Transformer models suffer from a quadratic computational complexity that scales with the input length. Sparse attention mechanisms have been introduced to mitigate this by focusing computations on a subset of the input, thereby reducing memory usage and computation time.

Enhancements to Sparse Attention: SCCA and SDA

This work introduces the Shifted Cross Chunk Attention (SCCA) method that improves information flow by employing different key-value (KV) shift strategies, allowing for direct attention to tokens outside the immediate window or chunk. SCCA partitions the input into chunks and uses shifting within multi-head attention to extend the receptive field, mimicking global attention patterns. Two shifting strategies are discussed: one that shifts KV pairs by a fixed amount in some heads, and another that varies the shift across different heads to simulate wider receptive fields. Furthermore, the paper combines Dilated Attention (DA) and Dilated Neighborhood Attention (DNA) to form Shifted Dilated Attention (SDA), which selects tokens globally in a dilated fashion across different heads.

Addressing Context Extrapolation

The paper goes on to discuss the challenge of context extrapolation in LLMs, which is critical for maintaining model performance when dealing with input sizes exceeding those seen during training. Position Interpolation algorithms have been developed to tackle this by extending context length without full fine-tuning, but an efficient attention pattern is still necessary for optimal performance. This paper combines SCCA with techniques like Positional Interpolation and LoRA to allow models to efficiently handle longer contexts without changing their original architectures.

Experimental Results and Conclusion

The proposed attention mechanisms were tested on linguistic modeling tasks using various SCCA patterns and combinations with SDA. The experiments demonstrated that SCCA combined with Positional Interpolation and LoRA outperforms current sparse attention methods, successfully extending the context size LLMs can handle, as shown by the perplexity scores on the PG19 and Proof datasets. The incorporation of SCCA and SDA into fine-tuning processes represents a significant advancement in efficiently modeling longer contexts.