Generalized Optimal Transport Attention (GOAT)
- The paper introduces GOAT, a novel attention mechanism that reformulates attention as a one-sided entropic optimal transport problem using a trainable prior.
- The methodology leverages a closed-form solution, integrating Fourier-based spectral components with a learnable sink to effectively incorporate positional biases.
- Empirical results demonstrate that GOAT enhances stability, improves extrapolation, and outperforms models like RoPE and ALiBi while maintaining efficiency on optimized kernels.
Generalized Optimal Transport Attention (GOAT) is a reformulation of the attention mechanism in neural networks, generalizing standard scaled-dot product attention by introducing a trainable, continuous prior under the Entropic Optimal Transport (EOT) framework. GOAT replaces the implicit uniform prior in classical attention with a learnable prior that enables more expressive structural inductive bias, providing enhanced stability, extrapolation, and efficiency while remaining fully compatible with fast attention kernels such as FlashAttention. This formulation yields a closed-form solution for the optimal transport problem and offers new insights into the behavior of attention sinks, incorporating positional and spatial priors directly into the attention computation (Litman et al., 21 Jan 2026).
1. Attention as Entropic Optimal Transport
Standard attention mechanisms compute a probability distribution over value vectors for each query via softmax normalization. GOAT casts this process as an instance of one-sided Entropic Optimal Transport. Let denote context length and the vector of unnormalized dot-product scores, . The EOT objective with Shannon entropy regularization seeks
This is equivalent to minimizing
where is the uniform prior. The solution recovers the conventional softmax: This establishes that standard attention is an instance of EOT regularized by a uniform prior.
2. Generalizing with a Trainable Prior
GOAT introduces a generalized EOT objective that replaces the uniform prior with a learnable continuous prior, : The analytic solution is
which can be written as 0. This design retains a closed-form solution in the one-sided scenario, removing the need for iterative Sinkhorn steps that arise in the general two-marginal EOT case. The trainable prior 1 enables the network to learn context- or position-dependent structural information, surpassing the limitations of a uniform prior.
3. EOT Perspective on Attention Sinks
In attention, sinks are locations that retain persistent attention weights regardless of context signal. From the EOT view, total logits are defined as 2, with normalized prior 3. The dynamic-range signal 4 quantifies the strength of the content-based scores.
Theorem 5.1 (Collapse to Prior) shows: 5 In the low-signal regime, the posterior collapses onto the prior. Sinks are characterized by margin (Definition 5.2): for query 6, key 7 is a sink if 8, ensuring 9 is lower-bounded independently of context length 0.
Theorem 5.3 contrasts prior types for context sensitivity. With uniform prior, in the low-signal limit 1 as 2 grows. In contrast, a peaked prior with margin 3 imposes
4
demonstrating that explicit learned sinks can exponentially suppress context noise.
4. FlashAttention Compatibility and Implementation
GOAT leverages a parameterization where the log prior 5 is embedded directly within the attention mechanism, maintaining compatibility with highly optimized kernels such as FlashAttention. Each attention head is split into a content subspace (6 dimensions) and a positional subspace (7 dimensions), with 8. The augmented queries and keys are: 9
0
so that a standard SDPA call 1 automatically realizes
2
with no additional asymptotic or memory cost. GOAT thereby functions as a drop-in replacement in standard and acceleration-optimized attention layers.
5. Structural Priors and Length Extrapolation
GOAT’s prior decomposes into a sum of a truncated Fourier (spectral) series and a key-only sink: 3 The relative term, 4, is constructed via Bochner’s theorem, with basis vectors
5
such that the dot-product recovers the 6-th series component. This design is translation-equivariant and generalizes across sequence lengths. The key-only sink 7, implemented as a rank-one “lane,” provides a minimal, content-disentangled query-independent bias. The learned spectral prior is stable under arbitrary sequence scaling.
Empirically, on C4 language modeling (trained to 2048 tokens), GOAT matches or surpasses RoPE and ALiBi in in-distribution perplexity, and extrapolates up to 8 context length without degradation. On synthetic long-context retrieval tasks, GOAT maintains near-perfect accuracy well beyond the training window. The learned Fourier prior remains robust under sequence length extension.
6. Comparative Analysis with Other Attention Variants
A comparative summary of attention mechanisms:
| Mechanism | Prior Type | Structural Properties | Extrapolation/Inductive Bias |
|---|---|---|---|
| Softmax | Uniform | No structure, emergent content-norm sinks | Poor extrapolation, generic |
| RoPE | Multiplicative rotations | Content-structure entanglement | Catastrophic out-of-distribution degradation |
| ALiBi | Linear slope in 9 | Fixed, underfits in-distribution, rigid | Extrapolates, limited adaptation |
| GOAT | Learned additive (Fourier+sink) | Fully expressive, translation-equivariant, disentangled | State-of-the-art extrapolation, stable, plug-and-play |
Empirical results indicate that in vision (e.g., ViT-Small at 0), GOAT achieves higher zero-shot accuracy at elevated resolutions than absolute embeddings. In genomics, GOAT matches RoPE in speed, reduces peak GPU memory by 36%, and decreases bits/base metric.
7. Theoretical and Empirical Properties
GOAT encompasses several formal and empirical properties:
- Provides a closed-form attention solution 1, sidestepping iterative Sinkhorn procedures.
- The spectral plus sink parameterization is identified as the unique, finite-dimensional, SDPA-compatible, translation-equivariant, bounded prior (Theorem 7.1), maximizing entropy recency (Theorem 7.3) and furnishing minimal-rank sinks (Theorem 7.4).
- Implementation incurs no additional asymptotic computational or memory cost and interfaces directly with performance-optimized attention kernels.
- Empirical tests resolve key trade-offs among expressivity, stability, and efficiency, substantiating state-of-the-art length generalization, stable attention sinks, and improved outcomes in cross-modal domains (Litman et al., 21 Jan 2026).