- The paper introduces SCCA, which shifts key-value pairs across chunks to enhance long-context processing in Transformers.
- It combines SCCA with SDA, Positional Interpolation, and LoRA to extend receptive fields without altering model architecture.
- Experiments demonstrate improved perplexity on PG19 and Proof datasets, confirming enhanced efficiency in handling extended contexts.
Introduction to Sparse Attention Mechanisms
The Transformer architecture has gained widespread use in LLMs. However, the challenge remains in efficiently processing long context sequences, as traditional Transformer models suffer from a quadratic computational complexity that scales with the input length. Sparse attention mechanisms have been introduced to mitigate this by focusing computations on a subset of the input, thereby reducing memory usage and computation time.
Enhancements to Sparse Attention: SCCA and SDA
This work introduces the Shifted Cross Chunk Attention (SCCA) method that improves information flow by employing different key-value (KV) shift strategies, allowing for direct attention to tokens outside the immediate window or chunk. SCCA partitions the input into chunks and uses shifting within multi-head attention to extend the receptive field, mimicking global attention patterns. Two shifting strategies are discussed: one that shifts KV pairs by a fixed amount in some heads, and another that varies the shift across different heads to simulate wider receptive fields. Furthermore, the paper combines Dilated Attention (DA) and Dilated Neighborhood Attention (DNA) to form Shifted Dilated Attention (SDA), which selects tokens globally in a dilated fashion across different heads.
Addressing Context Extrapolation
The paper goes on to discuss the challenge of context extrapolation in LLMs, which is critical for maintaining model performance when dealing with input sizes exceeding those seen during training. Position Interpolation algorithms have been developed to tackle this by extending context length without full fine-tuning, but an efficient attention pattern is still necessary for optimal performance. This paper combines SCCA with techniques like Positional Interpolation and LoRA to allow models to efficiently handle longer contexts without changing their original architectures.
Experimental Results and Conclusion
The proposed attention mechanisms were tested on linguistic modeling tasks using various SCCA patterns and combinations with SDA. The experiments demonstrated that SCCA combined with Positional Interpolation and LoRA outperforms current sparse attention methods, successfully extending the context size LLMs can handle, as shown by the perplexity scores on the PG19 and Proof datasets. The incorporation of SCCA and SDA into fine-tuning processes represents a significant advancement in efficiently modeling longer contexts.