LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid (2502.07563v1)

Published 11 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.

Summary

The paper presents LASP-2, a novel sequence parallelism method that rethinks computation and communication for linear attention, significantly enhancing efficiency for very long sequences.
LASP-2 utilizes a single AllGather communication operation on intermediate memory states, improving parallelism compared to previous methods and enabling an extension (LASP-2H) for hybrid linear and standard attention models.
Experiments show LASP-2 achieves notable speed improvements (15.2% over LASP-1, 36.6% over Ring Attention) and scales linearly with sequence length while maintaining constant memory per GPU, handling sequences up to 2048K.

The paper presents LASP-2, a novel sequence parallelism (SP) method designed specifically for linear attention-based transformer models. LASP-2 addresses the inefficiencies of previous methods like LASP-1 and Ring Attention in handling very long input sequences, achieving significant improvements in both communication and computation parallelism.

Key Contributions

Redesign of SP for Linear Attention: LASP-2 introduces an optimized approach by rethinking the entire communication and computation workflow for linear attention layers. Unlike LASP-1, which uses a ring-style point-to-point (P2P) communication strategy, LASP-2 employs a more efficient single AllGather collective communication operation on intermediate memory states. This change, along with workflow reorganization, significantly enhances both the communication and computation parallelism of linear attention, offering substantial improvements in efficiency.
Extension to Hybrid Models: To cater to hybrid models that integrate both linear and standard attention layers, the paper extends LASP-2 to LASP-2H. This extension employs a unified all-gather-based communication strategy, allowing for effective SP in hybrid model architectures.
Experimental Validation: The paper provides robust experimental results demonstrating the benefits of LASP-2 and LASP-2H. It evaluates the models on sequence lengths up to 2048K across 64 GPUs. LASP-2 outperforms LASP-1 with a 15.2% speed improvement and Ring Attention by 36.6% in terms of training throughput at long sequence lengths.
Scalability and Memory Efficiency: LASP-2 can scale linearly with sequence length by increasing the number of GPUs, while maintaining constant memory usage per GPU. The pure sequence parallelism scenario examined confirms that LASP-2 can handle sequences up to 2048K with consistent computational resources.
Convergence Performance: In terms of convergence, LASP-2 achieves comparable loss values and slightly better throughput on pure linear models compared to other SP methods when replacing standard attention modules. Hybrid models demonstrate enhanced convergence performance, with architectures blending linear and standard attention achieving results comparable to, or better than, standard attention baselines.

Technical Insights

Linear Attention Reformulation: The paper leverages linear attention's right-product-first characteristic, which prioritizes matrix multiplication associativity in the linear recurrent formulation, thereby reducing both computation and memory complexities from quadratic to linear.
Causal Mask Handling: LASP-2 incorporates intra-chunk and inter-chunk computation decomposition to manage causal masks effectively during autoregressive tasks, allowing the retention of linear complexity advantage.
Theoretical Analysis: Theoretical cost analysis reveals that LASP-2 reduces communication traffic compared to LASP-1 due to its more efficient communication strategy, thus offering better scalability in distributed systems.

Overall, LASP-2 introduces a comprehensive solution to enhance the efficiency of training transformer models with very long input sequences, making LASP-2 particularly suitable for large-scale distributed systems. The paper’s advancements suggest significant potential applications in models leveraging linear sequence modeling, providing a potent framework for future research and practical deployments.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - OpenSparseLLMs/Linear-MoE (18 stars)

Tweets

https://twitter.com/arXivGPT/status/1890462448471117830