BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
Abstract: Effective attention modules have played a crucial role in the success of Transformer-based LLMs, but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. In Proceedings of NeurIPS, pp. 1877–1901, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- FlashAttention: Fast and memory-efficient exact attention with io-awareness. In Proceedings of NeurIPS, pp. 16344–16359, 2022.
- LongNet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021.
- GPipe: efficient training of giant neural networks using pipeline parallelism. In Proceedings of NuerIPS, pp. 103–112, 2019.
- Perceiver: General perception with iterative attention. In Proceedings of ICML, pp. 4651–4664, 2021.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of ICML, pp. 5156–5165, 2020.
- Reducing activation recomputation in large transformer models. In Proceedings of MLSYS, 2023.
- Set Transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of ICML, pp. 3744–3753, 2019.
- Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
- Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
- Efficient large-scale language model training on gpu clusters using Megatron-LM. In Proceedings of SC, 2021.
- Training language models to follow instructions with human feedback. pp. 27730–27744, 2022.
- The devil in linear transformer. In Proceedings of EMNLP, pp. 7025–7041, 2022.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21:5485–5551, 2020.
- ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of SC, 2020.
- ZeRO-Offload: Democratizing billion-scale model training. In Proceedings of ATC, pp. 551–564, 2021.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Valiant, L. G. A bridging model for parallel computation. Communications of the ACM, pp. 103–111, 1990.
- Attention is all you need. In Proceedings of NeurIPS, 2017.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Lightweight and efficient end-to-end speech recognition using low-rank transformer. In Proceedings of ICASSP, pp. 6144–6148, 2020.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.