Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ring Attention with Blockwise Transformers for Near-Infinite Context (2310.01889v4)

Published 3 Oct 2023 in cs.CL
Ring Attention with Blockwise Transformers for Near-Infinite Context

Abstract: Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on LLMing and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.

Ring Attention with Blockwise Transformers for Near-Infinite Context

This paper presents a novel technique, Ring Attention with Blockwise Transformers (Ring Attention), that addresses the memory challenges associated with Transformers, particularly when dealing with long sequences. The importance of this work lies in its ability to extend the context length of models significantly, achieving near-infinite context sizes by leveraging distributed computing across multiple devices.

Approach and Methodology

The proposed Ring Attention technique operates by distributing the sequence dimension across multiple devices, thus leveraging blockwise computation of self-attention and feedforward layers. This method ensures efficient parallelization without the need for approximations. Key to this approach is the ring topology, where devices exchange key-value blocks in a rotating fashion while concurrently computing blockwise operations. By overlapping communication with computation, Ring Attention achieves substantial memory savings and enables sequence processing that scales linearly with the number of devices. This capability allows the model to handle context sizes that were previously unmanageable with traditional Transformers, surpassing millions of tokens.

Experimental Results

The paper provides extensive experimental validation of the technique's effectiveness. For example, using TPUv4-1024, the Ring Attention model supports context sizes exceeding 16 million tokens, a 512-fold increase compared to prior methods with memory-efficient Transformers. These results were consistently observed across different hardware setups, including various configurations of A100 GPUs and TPUs, and with models of varying sizes such as 3B, 7B, 13B, and 30B parameters.

Practical Implications

The practical applications of Ring Attention are numerous and significant. By eliminating the context memory bottleneck, models can now process long videos, large code repositories, or detailed scientific data without compromising on sequence length. This broadens the scope for Transformers in sectors like video-audio-LLMing, complex trial-and-error reinforcement learning, and scientific computation.

Theoretical Implications

Theoretically, this approach challenges the conventional scaling laws associated with memory constraints in Transformers. It also highlights how overlapping communication and computation can potentially redefine efficiency in distributed deep learning systems. Ring Attention exemplifies how memory-efficient designs can pave the way for more scalable AI systems.

Future Directions

Future research directions could explore the integration of Ring Attention with various forms of parallelism, such as data or tensor parallelism, for even larger models. Furthermore, examining its applicability to more diverse tasks, beyond natural language processing and reinforcement learning, could extend the utility of Transformers across other domains. There is also potential for optimizing the network bandwidth utilization, further enhancing scaling efficiency.

Conclusion

In sum, the Ring Attention approach offers a compelling solution to the memory constraints that limit the scalability of Transformers. Its ability to facilitate training and inference over tremendously longer context sequences, without adding overhead, marks a substantial step forward. As the challenges in AI continue to grow in complexity and scale, innovations like Ring Attention will be crucial in advancing the capabilities of AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  2. Christian Bischof. Parallel computing: Architectures, algorithms, and applications, volume 15. IOS Press, 2008.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  5. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org, 2023.
  7. Transformations to parallel codes for communication-computation overlap. In SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pages 58–58. IEEE, 2005.
  8. Mpi-aware compiler optimizations for improving communication-computation overlap. In Proceedings of the 23rd international conference on Supercomputing, pages 316–325, 2009.
  9. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  10. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
  11. Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineering.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/, 2023.
  12. Openllama: An open reproduction of llama, may 2023. URL https://github. com/openlm-research/open_llama, 2023.
  13. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
  14. Andrew Gibiansky. Bringing hpc techniques to deep learning. Baidu Research, Tech. Rep., 2017.
  15. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  16. Building a fault tolerant mpi application: A ring communication example. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1549–1556. IEEE, 2011.
  17. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
  18. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022.
  19. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021.
  20. How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  21. Sequence parallelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.134. URL https://aclanthology.org/2023.acl-long.134.
  22. Emergent agentic transformer from chain of hindsight experience. International Conference on Machine Learning, 2023a.
  23. Blockwise parallel transformer for large context models. Advances in neural information processing systems, 2023b.
  24. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
  25. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b.
  26. Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
  27. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  28. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  29. OpenAI. Gpt-4 technical report, 2023.
  30. Self-attention does not need o(n2) memory. arXiv preprint arXiv:2112.05682, 2021.
  31. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  32. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. URL https://openai.com/blog/chatgpt.
  33. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  34. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  35. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  38. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 93–106, 2022.
  39. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hao Liu (497 papers)
  2. Matei Zaharia (101 papers)
  3. Pieter Abbeel (372 papers)
Citations (128)
Youtube Logo Streamline Icon: https://streamlinehq.com