Papers
Topics
Authors
Recent
2000 character limit reached

Infini-Transformers: Infinite-Context Models

Updated 14 December 2025
  • Infini-Transformers are advanced architectures that employ infinite-dimensional kernel methods to process unbounded input contexts.
  • They integrate compressive memory and efficient attention mechanisms to summarize arbitrarily long sequences with constant resource usage.
  • These models achieve universal approximation and improved empirical performance on sequence tasks by balancing high capacity with bounded compute and memory overhead.

Infini-Transformers are a broad class of Transformer architectures designed to process unbounded input contexts and/or operate with infinite-dimensional (or function-valued) representations. This framework spans theoretical kernel-based perspectives, practical compressive-memory architectures for efficient infinite-context modeling, and operator-theoretic extensions to infinite-dimensional input spaces. Central to Infini-Transformers is the ability to retain, summarize, and utilize arbitrarily long or non-finite context while maintaining bounded computational and memory requirements, often leveraging advanced kernel methods, compressive memory mechanisms, or operator-valued representations.

1. Infinite-Dimensional Kernel Perspective

The foundational insight behind Infini-Transformers is that the canonical dot-product attention mechanism corresponds to a non-Mercer, infinite-dimensional binary kernel operating over Banach spaces, as formulated in the context of reproducing kernel Banach spaces (RKBS). For a query q=WQxRdq = W_Q x \in \mathbb{R}^d and key k=WKyRdk = W_K y \in \mathbb{R}^d, the kernel takes the form

k(x,y)=exp(qTk/d),k(x, y) = \exp(q^T k / \sqrt{d}),

which admits an infinite series expansion in multi-indexed feature maps ϕp(x)\phi_p(x) and ψp(y)\psi_p(y) with pNdp \in \mathbb{N}^d: k(x,y)=pNdϕp(x)ψp(y).k(x, y) = \sum_{p \in \mathbb{N}^d} \phi_p(x)\psi_p(y). This expansion mirrors the Taylor series of the exponential and establishes that each attention head computes a bilinear form in an infinite-dimensional feature space.

Within this RKBS framework, the transformer attention can be understood as learning a pair of functions (f,g)(f, g) in dual Banach spaces BXB_X, BYB_Y that together predict (via the kernel) all cross-domain responses for data pairs (xi,yj)(x_i, y_j). The associated representer theorem ensures that the empirical risk minimizer over cross-domain data lies within a finite span of kernel sections k(xi,)k(x_i, \cdot) and k(,yj)k(\cdot, y_j): f()=i,jαijk(xi,),g()=i,jαijk(,yj).f^*(\cdot) = \sum_{i, j} \alpha_{ij} k(x_i, \cdot), \quad g^*(\cdot) = \sum_{i, j} \alpha_{ij} k(\cdot, y_j). Transformers deploy parameterized, finite-dimensional projections (learned WQ,WKW_Q, W_K) as feature maps, but the associated kernel remains infinite-dimensional in principle (Wright et al., 2021).

2. Universal Approximation and Empirical Performance

The universal approximation theorem for attention-based architectures establishes that a stack of transformer layers can approximate any continuous, bounded pairwise function F(x,y)F(x, y) on compact domains arbitrarily well. Any target function can be represented as a sum of shallow parametric dot-products, transferring via the exponential to the space of attention kernels: F(x,y)e=1dqe(x)ke(y).F(x, y) \approx \sum_{e=1}^{d} q_e(x) k_e(y). This property guarantees that transformer attention can represent arbitrary pairwise dependencies, substantiating the model’s high expressivity.

Empirical studies confirm that retaining the full infinite-dimensionality of the exponentiated dot-product kernel is crucial for complex sequence modeling. On benchmarks such as IWSLT14 DE→EN and WMT14 EN→FR, infinite-dimensional kernels (including RBF and exponentiated intersection forms) produce substantial BLEU score improvements over low-dimensional alternatives. For other tasks, such as sentiment classification, the picture is more nuanced but still affirms the utility of rich, high-capacity feature mappings (Wright et al., 2021).

3. Practical Architectures for Infinite-Context Attention

Infini-Transformers have motivated the development of architectures that address the quadratic scaling of self-attention and the unbounded memory requirement when context grows. Several recent models realize efficient infinite-context computation:

Infini-attention

Infini-attention (Munkhdalai et al., 10 Apr 2024) fuses standard masked local attention (windowed dot-product) with a fixed-size, incrementally updated compressive memory for long-term context. At each Transformer block and for each attention head, two memory states are maintained: a segment-local key–value cache (Ks,Vs)(K_s,V_s) and a global compressive memory (Ms(h),zs(h))(M_s^{(h)}, z_s^{(h)}). The global memory is updated as: MsMs1+σ(Ks)Vs,zszs1+t=1Nσ(Ks[t]),M_s \leftarrow M_{s-1} + \sigma(K_s)^\top V_s, \quad z_s \leftarrow z_{s-1} + \sum_{t=1}^N \sigma(K_s[t]), with σ(x)=ELU(x)+1\sigma(x) = \mathrm{ELU}(x) + 1. Attention combines local and memory-based context via a learned gate parameter per head. This yields O(1)O(1) (constant) memory in the number of segments, bounded per-segment compute, and strong extrapolation to million-token contexts.

Continuum Transformers and Operator RKHS

Continuum Transformers generalize attention to infinite-dimensional function spaces, replacing finite-dimensional input tokens with functions fi:ΩRf_i: \Omega \to \mathbb{R} and employing bounded linear operators as projections. The attention mechanism computes operator-valued kernels over Hilbert spaces, and empirical in-context learning via the transformer forward pass can be described as gradient descent in an operator RKHS. With increasing depth, the model converges to the Bayes optimal predictor in context (Mishra et al., 23 May 2025).

Continuous-Space Memory (∞-former)

The \infty-former (Martins et al., 2021) compresses arbitrarily long context sequences into a bounded set of continuous basis functions via ridge regression over radial basis functions (RBFs), defining Xˉ(t)=Bψ(t)\bar{X}(t) = B^\top \psi(t). Attention queries past context through parameterized Gaussian densities over the normalized time axis, yielding memory and compute per layer that is independent of total history length. "Sticky memories" prioritize regions with high past usage by allocating more resolution in those intervals via dynamic resampling.

Memory-Gated Segmental Attention (EdgeInfinite)

EdgeInfinite (Chen et al., 28 Mar 2025) introduces a segment-wise compressed memory module into standard multi-head self-attention. For each segment, a summary matrix MiM_i and normalizer ziz_i compress the history; attention over past context is reconstructed via

Amem=σ(Qir)Mi1σ(Qir)zi1.A_{\text{mem}} = \frac{\sigma(Q_i^r) M_{i-1}}{\sigma(Q_i^r) z_{i-1}}.

A trainable gating MLP fuses memory-based and local outputs, with only \approx0.15% new parameters. GPU memory and time-to-first-token (TTFT) remain flat with increasing input length, enabling efficient deployment on edge devices.

4. Complexity, Memory Scaling, and Empirical Results

A central challenge addressed by Infini-Transformer architectures is sublinear or bounded scaling of both memory and computational cost as input context length increases.

  • Full attention: O(T2)O(T^2) compute and memory in context length TT.
  • Infini-attention and EdgeInfinite: Per-layer, per-segment cost O(N2+Ndkdv)O(N^2 + N d_k d_v) (with segment size NN), and constant O(HL(dkdv+dk))O(H L (d_k d_v + d_k)) parameter/memory overhead. No need to cache all past K,VK,V (Munkhdalai et al., 10 Apr 2024, Chen et al., 28 Mar 2025).
  • \infty-former: Attention over long-term memory compresses to a fixed number of basis functions NN, making complexity O(L2+LN)O(L^2 + L \cdot N) per layer regardless of total processed tokens (Martins et al., 2021).

Selected empirical findings:

Model/Task Memory (params/GB) Maximum Context PPL / Acc (long-context) Throughput/TTFT
Infini-attention (linear) 1.6M 10610^6 tokens PG19: 9.65 Linear scaling in input
EdgeInfinite (BlueLM-3B) +O(d2)O(d^2) $24k$ tokens+ LongBench: +1.5 BLEU avg. GPU: $1.6$ GB ($24k$ tokens)
\infty-former N=512N=512 basis $16k$ tokens+ Sorting: \sim80%@16k Flat per-layer memory

EdgeInfinite outperforms full-KV and other pruning baselines on multi-document QA, code tasks, and few-shot settings at a fraction of the memory cost. Infini-attention models achieve lower perplexity than memorizing or segment-only baselines even while compressing long-term memory by 100×100\times (Chen et al., 28 Mar 2025, Munkhdalai et al., 10 Apr 2024).

5. Theoretical Generalizations and Operator Learning

Recent theoretical progress characterizes the in-context learning behavior of Infini-Transformers when operating over function-valued sequences. For continuum transformers, attention layers are shown to perform explicit gradient descent steps in the space of linear operators on a Hilbert space, leveraging a generalized representer theorem. In the infinite-depth regime, the resulting predictor converges to the Bayes optimal solution for the observed context, formalized as the best linear unbiased predictor (BLUP) under an operator-valued Gaussian process prior (Mishra et al., 23 May 2025).

Empirical validation shows that continuum transformers decrease in-context MSE monotonically with depth whenever the attention kernel matches the data-generating kernel, and that model parameters converge to precise operator structures required for optimality.

6. Design Principles and Architectural Implications

Designing next-generation Infini-Transformers is guided by several emerging principles:

  • Kernel Flexibility: Kernel choice in attention is a hyperparameter, and exploring indefinite/asymmetric forms (e.g., intersection or polynomial kernels) can adapt to task structure (Wright et al., 2021).
  • Multi-head as Vector-valued Kernels: Viewing attention heads as vector-valued kernels may motivate cross-head coupling strategies via multi-output RKBS techniques.
  • Compressive Summaries and Regularization: Architectural compression (e.g., matrix summarization, continuous signals, or segmental memory) enables infinite-context scaling with tradeoffs between precision, compute, and expressivity.
  • Gated Fusion and Specialization: Learnable gating parameters enable different heads or layers to specialize on short-range, long-range, or mixed-context retrieval (Munkhdalai et al., 10 Apr 2024).
  • Low-rank and Domain-specific Extensions: Low-rank approximations (e.g., Nyström methods) and domain-specific kernels (e.g., graph diffusion) offer further efficiency and generalization capacity.

A plausible implication is that continued advances in architectural compression, flexible kernel design, and functional representations will further expand the power and efficiency of Infini-Transformers in complex, long-context or non-Euclidean data regimes.

7. Open Challenges and Future Directions

Infini-Transformers represent a consolidation of both theoretical and practical innovations for unbounded context modeling. Remaining challenges include balancing lossy compression against precise retrieval (particularly for token-level QA), principled non-uniform memory allocation, and extending efficient infinite-context designs to non-textual or multimodal data. Theoretically, further work on operator-theoretic regimes for transformers—especially beyond Hilbert (into more general Banach) spaces—may yield insights into model capacity, optimization landscapes, and domain transferability.

Key references: (Wright et al., 2021, Munkhdalai et al., 10 Apr 2024, Martins et al., 2021, Chen et al., 28 Mar 2025, Mishra et al., 23 May 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Infini-Transformers.