Linear Attention Sequence Parallelism (2404.02882v3)
Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: https://github.com/OpenNLPLab/LASP.
- Rethinking attention with performers. ArXiv, abs/2009.14794, 2020.
- Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
- The Pile: An 800gb dataset of diverse text for language modeling, 2020.
- Deepspeed Ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
- Reducing activation recomputation in large transformer models, 2022.
- Efficient memory management for large language model serving with pagedattention, 2023.
- ALBERT: A lite BERT for self-supervised learning of language representations, 2020.
- LightSeq: Sequence level parallelism for distributed training of long context transformers, 2023.
- Pytorch Distributed: Experiences on accelerating data parallel training, 2020.
- Sequence Parallelism: Long sequence training from system perspective, 2022.
- Ring attention with blockwise transformers for near-infinite context, 2023.
- Random feature attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB.
- Accelerating toeplitz neural network with constant-time inference complexity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, December 2023.
- The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7025–7041, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.473.
- cosFormer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=Bl8CQrx2Up4.
- Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023a.
- Linearized relative positional encoding. Transactions on Machine Learning Research, 2023b.
- TransNormerLLM: A faster and better large language model with improved transnormer, 2024a.
- Lightning Attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024b.
- Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024c.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. CoRR, abs/2112.05682, 2021. URL https://arxiv.org/abs/2112.05682.
- Zero: Memory optimizations toward training trillion parameter models, 2020.
- Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- CO2: Efficient distributed training with full communication-computation overlap. arXiv preprint arXiv:2401.16265, 2024.
- MS-Net: A multi-path sparse model for motion prediction in multi-scenes. IEEE Robotics and Automation Letters, 9(1):891–898, 2024. doi: 10.1109/LRA.2023.3338414.
- Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- OPT: Open pre-trained transformer language models, 2022.
- Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, pp. 27011–27041. PMLR, 2022.
- Efficient attention via control variates. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=G-uNfHKrj46.
- pbSGD: Powered stochastic gradient descent methods for accelerated non-convex optimization. In IJCAI, pp. 3258–3266, 2020.