Linear Attention Sequence Parallelism (2404.02882v3)

Published 3 Apr 2024 in cs.LG and cs.CL

Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: https://github.com/OpenNLPLab/LASP.

References (38)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces LASP, which leverages linear attention to scale sequence lengths up to 4096K on 128 A100 80G GPUs with up to 8x improvement.
The paper details system engineering optimizations like kernel fusion and GPU-based KV state caching to cut execution latency and memory overhead.
The paper demonstrates LASP’s compatibility with diverse distributed data parallel methods, ensuring efficient training across large-scale LLMs.

Exploring the Frontiers of Scalability in Linear Attention-Based Models with LASP

Introduction

The development and application of linear attention mechanisms in LLMs have offered a pathway to overcome the scalability limitations imposed by the traditional Softmax-based attention in transformer models. Despite this advancement, achieving efficient sequence parallelism (SP) in linear attention-based LLMs remains a challenging endeavor, primarily due to the suboptimal exploitation of linear attention features by existing SP methods. This limitation not only restrains the parallelism efficiency but also its applicability in scenarios that demand processing very long sequences or employing large clusters of GPUs. To bridge this gap, the introduction of Linear Attention Sequence Parallel (LASP) presents an innovative approach that accentuates the efficiency of SP in linear transformers by leveraging the inherent advantages of linear attention mechanisms.

Point-to-Point Communication Mechanism

One of the cornerstone innovations of LASP is its efficient point-to-point (P2P) communication mechanism, designed specifically to exploit the right-product kernel trick inherent to linear attention. This mechanism significantly reduces the communication overhead typically encountered in SP, making it an efficient solution for linear attention-based LLMs. Key to this approach is the independence from attention heads partitioning, offering compatibility with various model architectures, including those with multi-head, multi-query, and grouped-query attentions.

System Engineering Optimization

The practical implementation of LASP includes several optimizations that enhance its performance on GPU clusters:

Kernel Fusion: By fusing both intra-chunk and inter-chunk computations, as well as the updates for $\mathbf{KV}$ and $\mathbf{dKV}$ states, LASP minimizes kernel launches, thereby reducing execution latency.
KV State Caching: To avoid recomputation of intermediate states, LASP utilizes High Bandwidth Memory (HBM) on GPUs for caching $\mathbf{KV}$ states, which significantly reduces memory bandwidth requirements during the backward pass.

Compatibility with Distributed Data Parallel Training

LASP introduces a novel concept of data-sequence hybrid parallelism, making it compatible with a wide range of distributed data parallel (DDP) training methods. This compatibility is paramount for distributed training on large clusters, especially when handling long sequences and large batches simultaneously. The design and implementation of LASP ensure it works seamlessly with conventional DDP methods, sharded DDP implementations like FSDP and ZeRO, without impacting the training efficiency or scalability adversely.

Experimental Insights

Extensive experiments have been conducted to evaluate the scalability, efficiency, and applicability of LASP across variable sequence lengths and GPU cluster sizes. Emphasizing its superior performance, LASP has scaled sequence lengths up to 4096K using 128 A100 80G GPUs on 1B parameter models, outperforming existing SP methods by up to 8 times in terms of maximum achievable sequence length while maintaining or improving the speed of training.

Broad Implications

The introduction of LASP has practical and theoretical implications for the future of AI and machine learning research. Practically, it enables the processing of unprecedentedly long sequences in linear attention-based LLMs, opening new avenues for research and application in LLMing, biological sequence analysis, and beyond. Theoretically, LASP's unique approach to leveraging linear attention and optimizing communication and memory usage provides a new paradigm for designing efficient SP strategies in LLMs. Future developments may include further optimization of LASP for different hardware architectures or integrating LASP with emerging LLM architectures to unlock new capabilities.

In conclusion, LASP represents a significant advancement in the scalability and efficiency of linear attention-based LLMs. Its innovative design, efficiency in communication, and system engineering optimizations, combined with compatibility with various DDP methods, set a new benchmark for sequence parallelism in LLMs, offering a strong foundation for future advancements in AI and machine learning.

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1775706438561574967

https://twitter.com/sunweigao/status/1777263467667595503

https://twitter.com/trysquadai/status/1775809470670798853

https://twitter.com/hq2ng/status/1776251972267417754

https://twitter.com/knishimae0531/status/1775729076659204399

YouTube

Show All Videos