Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Attention Sequence Parallelism (2404.02882v3)

Published 3 Apr 2024 in cs.LG and cs.CL

Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: https://github.com/OpenNLPLab/LASP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Rethinking attention with performers. ArXiv, abs/2009.14794, 2020.
  2. Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  3. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  4. FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
  5. The Pile: An 800gb dataset of diverse text for language modeling, 2020.
  6. Deepspeed Ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023.
  7. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp.  5156–5165. PMLR, 2020.
  8. Reducing activation recomputation in large transformer models, 2022.
  9. Efficient memory management for large language model serving with pagedattention, 2023.
  10. ALBERT: A lite BERT for self-supervised learning of language representations, 2020.
  11. LightSeq: Sequence level parallelism for distributed training of long context transformers, 2023.
  12. Pytorch Distributed: Experiences on accelerating data parallel training, 2020.
  13. Sequence Parallelism: Long sequence training from system perspective, 2022.
  14. Ring attention with blockwise transformers for near-infinite context, 2023.
  15. Random feature attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB.
  16. Accelerating toeplitz neural network with constant-time inference complexity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, December 2023.
  17. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  7025–7041, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.473.
  18. cosFormer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=Bl8CQrx2Up4.
  19. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023a.
  20. Linearized relative positional encoding. Transactions on Machine Learning Research, 2023b.
  21. TransNormerLLM: A faster and better large language model with improved transnormer, 2024a.
  22. Lightning Attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024b.
  23. Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024c.
  24. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. CoRR, abs/2112.05682, 2021. URL https://arxiv.org/abs/2112.05682.
  25. Zero: Memory optimizations toward training trillion parameter models, 2020.
  26. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  27. CO2: Efficient distributed training with full communication-computation overlap. arXiv preprint arXiv:2401.16265, 2024.
  28. MS-Net: A multi-path sparse model for motion prediction in multi-scenes. IEEE Robotics and Automation Letters, 9(1):891–898, 2024. doi: 10.1109/LRA.2023.3338414.
  29. Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.
  30. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  31. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  32. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  33. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  34. OPT: Open pre-trained transformer language models, 2022.
  35. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  36. Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, pp.  27011–27041. PMLR, 2022.
  37. Efficient attention via control variates. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=G-uNfHKrj46.
  38. pbSGD: Powered stochastic gradient descent methods for accelerated non-convex optimization. In IJCAI, pp.  3258–3266, 2020.
Citations (1)

Summary

  • The paper introduces LASP, which leverages linear attention to scale sequence lengths up to 4096K on 128 A100 80G GPUs with up to 8x improvement.
  • The paper details system engineering optimizations like kernel fusion and GPU-based KV state caching to cut execution latency and memory overhead.
  • The paper demonstrates LASP’s compatibility with diverse distributed data parallel methods, ensuring efficient training across large-scale LLMs.

Exploring the Frontiers of Scalability in Linear Attention-Based Models with LASP

Introduction

The development and application of linear attention mechanisms in LLMs have offered a pathway to overcome the scalability limitations imposed by the traditional Softmax-based attention in transformer models. Despite this advancement, achieving efficient sequence parallelism (SP) in linear attention-based LLMs remains a challenging endeavor, primarily due to the suboptimal exploitation of linear attention features by existing SP methods. This limitation not only restrains the parallelism efficiency but also its applicability in scenarios that demand processing very long sequences or employing large clusters of GPUs. To bridge this gap, the introduction of Linear Attention Sequence Parallel (LASP) presents an innovative approach that accentuates the efficiency of SP in linear transformers by leveraging the inherent advantages of linear attention mechanisms.

Point-to-Point Communication Mechanism

One of the cornerstone innovations of LASP is its efficient point-to-point (P2P) communication mechanism, designed specifically to exploit the right-product kernel trick inherent to linear attention. This mechanism significantly reduces the communication overhead typically encountered in SP, making it an efficient solution for linear attention-based LLMs. Key to this approach is the independence from attention heads partitioning, offering compatibility with various model architectures, including those with multi-head, multi-query, and grouped-query attentions.

System Engineering Optimization

The practical implementation of LASP includes several optimizations that enhance its performance on GPU clusters:

  • Kernel Fusion: By fusing both intra-chunk and inter-chunk computations, as well as the updates for KV\mathbf{KV} and dKV\mathbf{dKV} states, LASP minimizes kernel launches, thereby reducing execution latency.
  • KV State Caching: To avoid recomputation of intermediate states, LASP utilizes High Bandwidth Memory (HBM) on GPUs for caching KV\mathbf{KV} states, which significantly reduces memory bandwidth requirements during the backward pass.

Compatibility with Distributed Data Parallel Training

LASP introduces a novel concept of data-sequence hybrid parallelism, making it compatible with a wide range of distributed data parallel (DDP) training methods. This compatibility is paramount for distributed training on large clusters, especially when handling long sequences and large batches simultaneously. The design and implementation of LASP ensure it works seamlessly with conventional DDP methods, sharded DDP implementations like FSDP and ZeRO, without impacting the training efficiency or scalability adversely.

Experimental Insights

Extensive experiments have been conducted to evaluate the scalability, efficiency, and applicability of LASP across variable sequence lengths and GPU cluster sizes. Emphasizing its superior performance, LASP has scaled sequence lengths up to 4096K using 128 A100 80G GPUs on 1B parameter models, outperforming existing SP methods by up to 8 times in terms of maximum achievable sequence length while maintaining or improving the speed of training.

Broad Implications

The introduction of LASP has practical and theoretical implications for the future of AI and machine learning research. Practically, it enables the processing of unprecedentedly long sequences in linear attention-based LLMs, opening new avenues for research and application in LLMing, biological sequence analysis, and beyond. Theoretically, LASP's unique approach to leveraging linear attention and optimizing communication and memory usage provides a new paradigm for designing efficient SP strategies in LLMs. Future developments may include further optimization of LASP for different hardware architectures or integrating LASP with emerging LLM architectures to unlock new capabilities.

In conclusion, LASP represents a significant advancement in the scalability and efficiency of linear attention-based LLMs. Its innovative design, efficiency in communication, and system engineering optimizations, combined with compatibility with various DDP methods, set a new benchmark for sequence parallelism in LLMs, offering a strong foundation for future advancements in AI and machine learning.

Youtube Logo Streamline Icon: https://streamlinehq.com