Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ring Attention for Scalable Transformers

Updated 3 March 2026
  • Ring Attention is an attention paradigm that partitions input sequences across devices to compute exact self-attention while reducing per-device memory usage.
  • It employs a blockwise ring communication strategy, overlapping key-value transfers with local computations to improve scalability and efficiency.
  • Its versatile applications in language modeling, audio processing, and haptics highlight significant gains in context length handling and compute performance.

Ring Attention is an attention paradigm designed to address the prohibitive memory and communication bottlenecks in Transformer models when processing very long sequences. It achieves exact (unapproximated) self-attention across sequences whose length scales linearly with the number of available compute devices, by overlapping a blockwise communication protocol with local attention and feedforward operations. Ring Attention has been instantiated in language modeling, reinforcement learning, audio signal processing, and haptics; it also serves as a foundation for distributed attention methods and has motivated further optimization in both workload balancing and communication complexity.

1. Motivation and High-Level Structure

Traditional Transformers require materializing or recomputing an L×LL \times L attention matrix, with O(L2)O(L^2) memory and compute complexity. This limits practical context sizes to at most 105\sim 10^510610^6 tokens, even with blockwise or memory-efficient variants. Ring Attention overcomes this by partitioning the input sequence of length LL into DD contiguous blocks of size B=L/DB = L/D, assigning each block to its own device. Each device ii holds queries QiRB×dQ_i \in \mathbb{R}^{B \times d} and initial key/value blocks Ki,ViRB×dK_i, V_i \in \mathbb{R}^{B \times d}. The devices then iteratively swap K,VK,V blocks in a unidirectional ring: in each inner-loop step, device ii sends its current (K,V)(K,V) to device (i+1modD)(i+1 \bmod D) and receives (K,V)(K,V) from device (i1modD)(i-1 \bmod D), computing attention between its stationary QiQ_i and the just-received K,VK,V. This process proceeds for DD rounds so each device sees all K,VK,V blocks, after which the output for each QiQ_i is fully accumulated. Communication is completely overlapped with the computation and incurs no additional overhead if the communication bandwidth meets or exceeds the compute/transfer rate condition (Liu et al., 2023).

2. Formal Algorithmic Description

The global attention outputs for partitioned queries, keys, and values,

Q=[Q1 Q2  QD],K=[K1 K2  KD],V=[V1 V2  VD],Q = \begin{bmatrix}Q_1 \ Q_2 \ \vdots \ Q_D\end{bmatrix}, \quad K = \begin{bmatrix}K_1 \ K_2 \ \vdots \ K_D\end{bmatrix}, \quad V = \begin{bmatrix}V_1 \ V_2 \ \vdots \ V_D\end{bmatrix},

are computed as, for device ii,

Attention(Qi,K,V)=j=1Dsoftmax(QiKjd)Vj.\text{Attention}(Q_i, K, V) = \sum_{j=1}^D \mathrm{softmax}\left(\frac{Q_i K_j^\top}{\sqrt{d}}\right) V_j.

At ring step tt, device ii interacts with Kj,VjK_j, V_j where j=(it)modDj = (i - t) \bmod D. At each step, the device computes the blockwise attention contribution,

Ai,j=softmax(QiKjd)Vj,A_{i,j} = \mathrm{softmax}\left(\frac{Q_i K_j^\top}{\sqrt{d}}\right) V_j,

and accumulates it using numerically stable, incremental softmax normalization. The per-device memory footprint is O(Bd)O(Bd), independent of total sequence length LL. Each device maintains three blocks (local QiQ_i; current and in-transit K,VK,V pairs), accumulators, and feedforward activations. Pseudocode and communication/computation overlap strategies are explicitly prescribed in (Liu et al., 2023).

3. Complexity Analysis

A comparison of computational and memory complexities:

Method Per-Device Memory Per-Device Compute Communication per Layer
Full-matrix attention O(L2d)O(L^2 d) O(L2d)O(L^2 d) -
Blockwise (local) O(Ld)O(Ld) O(L2d)O(L^2 d) -
Ring Attention O(Bd)O(Bd) O(L2d/D)O(L^2 d / D) O(Ld)O(Ld) (fully overlapped)

Ring Attention achieves exact attention with total compute unchanged, but distributes computation and memory evenly: each device handles O((L2d)/D)O((L^2 d)/D) compute and O(Bd)O(Bd) memory, with communication volume O(Ld)O(Ld) per device per layer. provided overlapped execution and sufficient network bandwidth, wall-clock per-device compute is reduced by a factor of DD (Liu et al., 2023). In contrast, naive distributed row-wise algorithms incur higher communication-to-compute ratios that impede scaling at high device count (Chen et al., 24 Dec 2025).

4. Applications and Empirical Performance

Language Modeling and RL: On clusters of up to 1024 TPUv4 chips, Ring Attention enables context windows up to 4 million tokens for LLMs (e.g., LLaMA-13B) and improves in-context reinforcement learning performance by facilitating aggregation over hundreds of trajectories. Experiments show model FLOPs utilization remains within a few percent of memory-efficient blockwise transformers even as context scales by 1–2 orders of magnitude. In direct comparison, Ring Attention enables up to 256×\times longer context than conventional blockwise methods with negligible engineering complexity overhead (Liu et al., 2023).

Audio and Neural Vocoding: Ring Attention has been incorporated into neural vocoders, such as RingFormer, where it fuses with convolution-augmented transformer blocks (Conformer). Here, ring attention allows the model to aggregate both local and global sequence information across tens or hundreds of thousands of audio frames, supporting both full-sequence ("global") and partial ("radius-rr") context aggregation. This approach outperforms or matches baseline models such as HiFi-GAN and BigVGAN on metrics like MCD, STOI, and perceptual MOS, and enables multi-hundredfold real-time inference speeds (Hong et al., 2 Jan 2025).

Human Haptics: In a distinct domain, ring-shaped vortex air pulses have been used as non-contact haptic attention cues for accessibility, notably in SHITARA, where air vortices are projected and precisely delivered to a user's head to signal social cues for deaf/hard-of-hearing users. Detection and comfort experiments demonstrate high accuracy at up to 2.5 meters, with specific recommendations for device configuration, waveform shaping, spatial targeting, and safety (Kojima et al., 2023).

Striped Attention: In causal transformer models, standard Ring Attention leads to workload imbalance because of the lower-triangular mask: some devices perform full computations, others (near the block-diagonal) only partial or zero work per iteration. Striped Attention resolves this by permuting the sequence such that each device's block samples tokens evenly throughout the sequence. This approach ensures each device processes approximately half the block in every round, halving per-round latency and yielding end-to-end throughput improvements up to 1.65×\times at very long sequence lengths, with negligible additional overhead (Brandon et al., 2023).

Mesh-Attention: Ring Attention is now recognized as a limiting case ("row-tile") within a broader scheduling family. Mesh-Attention generalizes block assignments to two-dimensional tiles in the computation assignment matrix, reducing the per-device communication volume by up to O(G)O(\sqrt{G}) compared to Ring Attention and maintaining higher computational locality. This method achieves up to 3.4×\times speedup and 85.4% lower communication volume experimentally at scale, although the basic ring protocol remains optimal for maximizing Q-block locality (Chen et al., 24 Dec 2025).

6. Limitations and Practical Considerations

Ring Attention is exact but its wall-clock efficiency requires overlapping all block communication with computation; this is feasible only if the computation-to-communication ratio matches device/network bandwidth. As the number of devices GG increases for fixed sequence length NN, communication may become dominant and limit scalability; practical strong-scaling is often bounded by this effect. In causal attention, naive Ring Attention incurs substantial inefficiency under triangular masks, motivating striped or block-interleaved algorithms. At very high device counts, two-dimensional tiling (Mesh-Attention) or further optimizations are required for continued scalability.

Ring Attention also imposes design choices regarding block size, device-to-block mapping, and incremental normalization. Implementation must ensure numerical stability (max-shifting, normalization), precise asynchronous communication, and minimal per-device memory overhead. Striped and mesh variants require additional up-front permutations or scheduling but retain the protocol's core logic.

7. Broader Impact and Generalizability

Ring Attention provides a template for exact attention computation on multi-device clusters for any large-sequence domain, from text to audio to biological data. It can be combined with memory compression, layer-wise reversal, or windowed/sparse attention for further trade-offs among context size, latency, and resource footprint. Its adoption has prompted the development of even more communication- and compute-efficient assignment schemes. In haptic interface design, “ring attention” via air vortex rings provides robust, contactless spatial alerting for accessibility, demonstrating that the principle transcends digital sequence computation and highlights the shared logic of “ring-mediated attention” across sensory and computational modalities (Liu et al., 2023, Hong et al., 2 Jan 2025, Chen et al., 24 Dec 2025, Brandon et al., 2023, Kojima et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ring Attention.