Infini-Attention: Infinite Context in Transformers
- Infini-attention is a framework that extends attention mechanisms to infinite-length inputs through rigorous infinite-width analyses and compressive memory architectures.
- It combines local sliding-window attention with global compressive memory and adaptive gating to maintain constant memory and compute regardless of sequence length.
- Empirical studies demonstrate state-of-the-art performance on tasks with up to 1M-token contexts while highlighting trade-offs in compression and retrieval fidelity.
Infini-attention refers to a family of neural attention mechanisms and theoretical frameworks that support either infinite-long input contexts, infinite-width neural scaling, or both. The term encompasses (1) rigorous infinite-width analyses of dot-product attention in deep networks—revealing hierarchical Gaussian structure and non-Gaussian marginal distributions—and (2) practical architectural augmentations allowing Transformer models to process arbitrarily long input sequences at bounded memory and compute. Infini-attention is realized via compressive or continuous-space memory, adaptive gating, and hybrid local/global retrieval. This entry provides a comprehensive overview of key technical foundations, formal derivations, algorithmic forms, empirical validations, and present limitations of infini-attention.
1. Infinite-Width Dot-Product Attention: Theory and Limit Laws
In the infinite-width regime, classical neural network Gaussian process (NNGP) and neural tangent kernel (NTK) theories generally assume Gaussian pre-activations in deep architectures. However, dot-product attention layers exhibit fundamentally different asymptotic behavior under practical scaling (i.e., standard where is hidden size) and finite number of heads. Specifically, for an attention layer with input , heads, and weight matrices initialized as , Tensor Programs (TP) analysis demonstrates the following:
- The set of all pre-softmax similarity scores across heads forms a jointly Gaussian vector as , with covariance dictated by the input covariance structure and head independence.
- Conditioned on any realization of similarity scores , each attention head output is a deterministic (softmax-weighted) linear combination of Gaussian-propagated values, thus yielding a Gaussian conditional distribution per head.
- Marginalizing over the randomness of (which itself appears nonlinearly in the softmax) results in a mixture of Gaussians—a hierarchical Gaussian law for the layer outputs. Such mixtures are in general non-Gaussian, with potential for heavy tails and deviations from elliptical symmetry.
- The infinite-head or $1/n$ scaling regime recovers classical Gaussianity by degenerating or averaging away the randomness of .
Formally, the multi-head attention limit is described as follows. Let be sampled from the joint Gaussian specified by input covariances. For each head, the conditional output is
with also Gaussian and covariances reflecting propagated input structure. Summing heads and marginalizing over yields the full output law. Extensive numerical validation confirms that this characterization accurately captures empirical distributions even for moderate , including the heavy-tail behavior and deviations from classical GP predictions (Sakai et al., 1 Jun 2025).
2. Compressive Memory Architectures for Unbounded Context
Architecturally, modern infini-attention mechanisms introduce a two-way hybridization of local (sliding-window) and memory-based (compressive/global) attention:
- Local Attention: Each Transformer block attends exactly over a finite window (typically a segment of tokens) via softmax dot-product over cached tensors.
- Compressive Global Memory: Past segments' states are recursively summarized into a low-rank or fixed-size memory , using associative updates of the form , with typically a positive nonlinearity such as ELU+1 or ReLU. A corresponding normalization accumulator tracks the scaling for later retrieval.
- Linear Attention Retrieval: For any query in the current segment, retrieval is performed as . This enables constant-size, streaming access to all previous context.
- Head-varying Gating: A learned scalar or vector (or small MLP in some variants) balances or fuses local and global sources per head: .
This paradigm appears with slight variations in Infini-Transformer (Munkhdalai et al., 2024), EdgeInfinite (Chen et al., 28 Mar 2025), InfiniteVL (Tao et al., 9 Dec 2025), and recent empirical studies in small model pretraining (Huang et al., 29 Dec 2025). Some systems further supplement with query-focused or compressive “selector” memories to prioritize relevant history (e.g., query-conditioned mixing in IDEAL (Cao et al., 2024)).
3. Algorithmic Form and Complexity
Infini-attention mechanisms achieve strict memory and streaming time per segment for arbitrarily long contexts, compared to for classical attention. At each step, only the following objects grow with model size, not sequence length:
- Local segment KV cache: per segment, held only for recent tokens.
- Compressed memory matrix: per head, often parameter-shared or small.
- Normalization vector: per head.
Per-segment computation involves local masked dot-product attention plus linear retrieval. Global memory update is performed as a single matrix-multiplication per segment. Gating or MLP-based fusion (when present) constitutes negligible additional computational cost.
Variants exist that further reduce the per-segment complexity using pure linear retrieval (Gated DeltaNet (Tao et al., 9 Dec 2025)), or that combine with sparse top- cache selection for extreme efficiency at million-token scales, as in ReAttention (Liu et al., 2024).
4. Empirical Benchmarking and Practical Impact
Infini-attention models routinely set new state-of-the-art results on long-context benchmarks, including:
- 1M-token passkey retrieval, sustaining 100% accuracy across all probe positions at context lengths up to and beyond 1M tokens, outperforming both full-KV cache and conventional compression schemes (Munkhdalai et al., 2024, Huang et al., 29 Dec 2025).
- 500K-token book summarization with improved or best-in-class ROUGE scores, with higher overall fidelity as visible context increases (Munkhdalai et al., 2024).
- Low-resource SLMs (300M-parameter) equipped with infini-attention achieving up to 31 percentage-point absolute accuracy gain over standard architectures on 16K-token retrieval (Huang et al., 29 Dec 2025).
- On hardware-limited scenarios, memory-augmented attention yields two orders of magnitude lower memory growth and superlinear latency reduction, enabling efficient unbounded-context serving on edge devices (Chen et al., 28 Mar 2025).
Queries on factual QA, multi-hop retrieval, and streaming summarization exhibit the most pronounced gains, with the compressive memory facilitating recall of early input fragments long after sliding windows have advanced. Query-focused memory modules additionally bias selection to task-relevant spans, as demonstrated in IDEAL's ablations (Cao et al., 2024).
5. Specializations, Variants, and Theoretical Extensions
Several advanced forms of infini-attention and related infinite-context methods exist:
- NTK-based Infinite Prefix Attention: By taking the limit of prefix length to infinity, it is possible to characterize the attention and learning dynamics in terms of neural tangent kernels, yielding provable polynomial-small approximation error and parameter-efficient fine-tuning with only new parameters per head (Liang et al., 2024).
- Continuous-space/RBF Memory: The -former replaces discrete memory by a kernel ridge regression fit to the sequence, projecting into a basis of radial basis functions. Attention is then computed by integrating over a learned Gaussian density, giving fixed complexity independent of sequence length, and allowing sticky memory allocation controller (Martins et al., 2021).
- Hierarchical Non-Gaussianity: Infinite-width limit analysis reveals that, under standard scaling and finite heads, the output distribution is a non-Gaussian, hierarchical mixture, fundamentally altering the infinite-width prior and its implications for initializations and downstream kernel learning (Sakai et al., 1 Jun 2025).
- Top-k Sparse and Retrieval-Augmented Methods: ReAttention and InfiniRetri exploit attention-based scoring for relevance-driven cache selection, achieving training-free and pure plug-and-play extension to essentially infinite input with a fixed attention scope (Liu et al., 2024, Ye et al., 18 Feb 2025).
6. Limitations, Open Problems, and Future Directions
Despite the significant efficiency and accuracy improvements, infini-attention mechanisms are bounded by several limitations:
- Information Loss via Compression: Compressing large amounts of prior context into a low-rank or fixed-size memory degrades recovery of fine-grained or multi-span dependencies. Repeated compression over extremely long sequences causes retrieval accuracy to degrade, especially in low-capacity models and when input distribution is biased towards short documents (Huang et al., 29 Dec 2025).
- Expressivity Trade-offs: Purely linear/compressive memory may fail to capture higher-order or relational signals needed for certain reasoning or generative tasks. Some systems remedy this through joint use of local attention, learned gating, or task-conditioned memory but at some additional computational or engineering overhead (Cao et al., 2024, Tao et al., 9 Dec 2025).
- Hyperparameter Sensitivity: Performance relies on proper selection of segment/window sizes, memory update rules (ELU vs. ReLU), and gating parameters. These elements must often be tuned empirically per task.
- Scalability and Drift: Open problems include stability and performance as memory saturates or drifts over extremely long streams, as well as principled scheduling of compression and segment updates for heterogeneous or multimodal inputs (Munkhdalai et al., 2024).
- Theoretical Unification: Ongoing research seeks to unify the practical memory-augmented attention structures with rigorous limit-law analyses, enabling hybrid models that preserve both tractable computation and theoretically grounded priors (Sakai et al., 1 Jun 2025, Hron et al., 2020).
Potential extensions include hierarchical and dynamic memory compression, task-aware retrieval modules, adaptive segment sizing, multi-modal compressive memories, and deeper connections to kernel and GP theory for infinite-depth analysis.
Infini-attention represents both a rigorous infinite-width limit law for attention and a practical engineering approach for unbounded-context efficient Transformer architectures, exhibiting a spectrum of non-Gaussianity, hierarchical Gaussian mixtures, and compressed-memory streaming algorithms. It is foundational for both the tractable analysis of deep self-attention networks and the deployment of scalable models on long-context reasoning, retrieval, and summarization tasks (Sakai et al., 1 Jun 2025, Munkhdalai et al., 2024, Chen et al., 28 Mar 2025, Huang et al., 29 Dec 2025, Cao et al., 2024, Martins et al., 2021, Liang et al., 2024, Tao et al., 9 Dec 2025, Hron et al., 2020, Liu et al., 2024, Ye et al., 18 Feb 2025).