FlashMask: Efficient and Rich Mask Extension of FlashAttention (2410.01359v1)

Published 2 Oct 2024 in cs.LG

Abstract: The computational and memory demands of vanilla attention scale quadratically with the sequence length $N$, posing significant challenges for processing long sequences in Transformer models. FlashAttention alleviates these challenges by eliminating the $O(N^2)$ memory dependency and reducing attention latency through IO-aware memory optimizations. However, its native support for certain attention mask types is limited, and it does not inherently accommodate more complex masking requirements. Previous approaches resort to using dense masks with $O(N^2)$ memory complexity, leading to inefficiencies. In this paper, we propose FlashMask, an extension of FlashAttention that introduces a column-wise sparse representation of attention masks. This approach efficiently represents a wide range of mask types and facilitates the development of optimized kernel implementations. By adopting this novel representation, FlashMask achieves linear memory complexity $O(N)$, suitable for modeling long-context sequences. Moreover, this representation enables kernel optimizations that eliminate unnecessary computations by leveraging sparsity in the attention mask, without sacrificing computational accuracy, resulting in higher computational efficiency. We evaluate FlashMask's performance in fine-tuning and alignment training of LLMs such as SFT, LoRA, DPO, and RM. FlashMask achieves significant throughput improvements, with end-to-end speedups ranging from 1.65x to 3.22x compared to existing FlashAttention dense method. Additionally, our kernel-level comparisons demonstrate that FlashMask surpasses the latest counterpart, FlexAttention, by 12.1% to 60.7% in terms of kernel TFLOPs/s, achieving 37.8% to 62.3% of the theoretical maximum FLOPs/s on the A100 GPU. The code is open-sourced on PaddlePaddle and integrated into PaddleNLP, supporting models with over 100 billion parameters for contexts up to 128K tokens.

PDF HTML Abstract

FlashMask: Efficient and Rich Mask Extension of FlashAttention

The paper "FlashMask: Efficient and Rich Mask Extension of FlashAttention" addresses the computational and memory inefficiencies of vanilla attention mechanisms in Transformer models. Traditional attention mechanisms exhibit a quadratic complexity in both computational and memory requirements, which poses significant challenges when handling long sequences.

Contributions

Column-wise Sparse Mask Representation: FlashMask introduces a column-wise sparse representation of attention masks to efficiently accommodate various mask types while achieving linear memory complexity, i.e., $\mathcal{O}(N)$ .
Optimized Kernel Implementations: This new sparse representation enables the development of optimized kernels, enhancing computational efficiency without sacrificing accuracy.
Extensive Evaluation: FlashMask shows substantial throughput improvements and computational speedups in practical LLM fine-tuning and alignment training tasks, outperforming existing methods like FlashAttention and FlexAttention.

Masking Strategies and Impact

In Transformer models, different tasks necessitate specific masking strategies. Vanilla attention mechanisms encounter computational and memory overheads due to their $\mathcal{O}(N^2)$ complexity. FlashAttention mitigates this through IO-aware memory optimizations but is restricted in handling complex masks.

The column-wise sparse representation offered by FlashMask represents attention masks via start and end indices for masked intervals in each column of the attention score matrix. This approach reduces the dense 2D mask to more compact 1D intervals, facilitating efficient kernel processing. This compact representation is instrumental in extending the mask-handling capabilities of FlashAttention to support arbitrary and complex mask types.

Computational Efficiency and Analysis

FlashMask maintains linear space complexity and reduces unnecessary computations by leveraging sparsity in attention masks. The memory access complexity is significantly reduced compared to dense masks, as demonstrated through rigorous complexity analysis and experimental results.

Experimental Validation and Results

End-to-End Throughput: FlashMask displays substantial improvements in end-to-end training throughput across various model scales and sequence lengths, achieving speedups of 1.65x to 3.22x. Such enhancements are critical for training LLMs with long sequence lengths.
Memory Consumption: By achieving linear memory overhead, FlashMask supports training sequences much longer than previously feasible with dense methods.
Kernel Performance: FlashMask outperforms FlexAttention in various scenarios, with improvements ranging from 12.1% to 60.7% in kernel TFLOPs/s.

Future Directions

While FlashMask demonstrates superior efficiency in handling complex masking patterns, it does not support completely arbitrary masks, such as those with scattered masked regions within a column. Future research could explore more sophisticated sparse representations that balance expressiveness and computational efficiency. Moreover, integrating FlashMask’s enhancements into other deep learning frameworks beyond PaddlePaddle could cater to a broader user base.

Conclusion

FlashMask extends FlashAttention by introducing a column-wise sparse mask representation, enabling efficient handling of a wide range of attention masking types. It significantly reduces memory and computational overheads, validating its efficacy through extensive experimental evaluations. These advancements pave the way for more efficient Transformer models capable of handling long-context sequences, crucial for the continued evolution of LLMs.