Generalized Optimal Transport Attention (GOAT)

Updated 5 May 2026

The paper introduces GOAT, a novel attention mechanism that reformulates attention as a one-sided entropic optimal transport problem using a trainable prior.
The methodology leverages a closed-form solution, integrating Fourier-based spectral components with a learnable sink to effectively incorporate positional biases.
Empirical results demonstrate that GOAT enhances stability, improves extrapolation, and outperforms models like RoPE and ALiBi while maintaining efficiency on optimized kernels.

Generalized Optimal Transport Attention (GOAT) is a reformulation of the attention mechanism in neural networks, generalizing standard scaled-dot product attention by introducing a trainable, continuous prior under the Entropic Optimal Transport (EOT) framework. GOAT replaces the implicit uniform prior in classical attention with a learnable prior that enables more expressive structural inductive bias, providing enhanced stability, extrapolation, and efficiency while remaining fully compatible with fast attention kernels such as FlashAttention. This formulation yields a closed-form solution for the optimal transport problem and offers new insights into the behavior of attention sinks, incorporating positional and spatial priors directly into the attention computation (Litman et al., 21 Jan 2026).

1. Attention as Entropic Optimal Transport

Standard attention mechanisms compute a probability distribution over value vectors for each query via softmax normalization. GOAT casts this process as an instance of one-sided Entropic Optimal Transport. Let $L$ denote context length and $\bm{s}\in\mathbb{R}^L$ the vector of unnormalized dot-product scores, $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ . The EOT objective with Shannon entropy regularization seeks

$\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$

This is equivalent to minimizing

$\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),$

where $\mathcal{U}$ is the uniform prior. The solution recovers the conventional softmax: $p_j^\star = \frac{\exp(s_j/\tau)}{\sum_{k} \exp(s_k/\tau)}.$ This establishes that standard attention is an instance of EOT regularized by a uniform prior.

2. Generalizing with a Trainable Prior

GOAT introduces a generalized EOT objective that replaces the uniform prior with a learnable continuous prior, $\bm{\pi} \in \Delta^{L-1}$ : $\bm{p}^\star = \arg\min_{p\in\Delta^{L-1}} \left\{ -\langle p, s \rangle + \tau KL(p \| \pi) \right\}.$ The analytic solution is

$p_j^\star = \frac{\pi_j \exp(s_j / \tau)}{\sum_k \pi_k \exp(s_k / \tau)},$

which can be written as $\bm{s}\in\mathbb{R}^L$ 0. This design retains a closed-form solution in the one-sided scenario, removing the need for iterative Sinkhorn steps that arise in the general two-marginal EOT case. The trainable prior $\bm{s}\in\mathbb{R}^L$ 1 enables the network to learn context- or position-dependent structural information, surpassing the limitations of a uniform prior.

3. EOT Perspective on Attention Sinks

In attention, sinks are locations that retain persistent attention weights regardless of context signal. From the EOT view, total logits are defined as $\bm{s}\in\mathbb{R}^L$ 2, with normalized prior $\bm{s}\in\mathbb{R}^L$ 3. The dynamic-range signal $\bm{s}\in\mathbb{R}^L$ 4 quantifies the strength of the content-based scores.

Theorem 5.1 (Collapse to Prior) shows: $\bm{s}\in\mathbb{R}^L$ 5 In the low-signal regime, the posterior collapses onto the prior. Sinks are characterized by margin (Definition 5.2): for query $\bm{s}\in\mathbb{R}^L$ 6, key $\bm{s}\in\mathbb{R}^L$ 7 is a sink if $\bm{s}\in\mathbb{R}^L$ 8, ensuring $\bm{s}\in\mathbb{R}^L$ 9 is lower-bounded independently of context length $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 0.

Theorem 5.3 contrasts prior types for context sensitivity. With uniform prior, in the low-signal limit $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 1 as $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 2 grows. In contrast, a peaked prior with margin $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 3 imposes

$s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 4

demonstrating that explicit learned sinks can exponentially suppress context noise.

4. FlashAttention Compatibility and Implementation

GOAT leverages a parameterization where the log prior $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 5 is embedded directly within the attention mechanism, maintaining compatibility with highly optimized kernels such as FlashAttention. Each attention head is split into a content subspace ( $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 6 dimensions) and a positional subspace ( $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 7 dimensions), with $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 8. The augmented queries and keys are: $s_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}$ 9

$\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 0

so that a standard SDPA call $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 1 automatically realizes

$\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 2

with no additional asymptotic or memory cost. GOAT thereby functions as a drop-in replacement in standard and acceleration-optimized attention layers.

5. Structural Priors and Length Extrapolation

GOAT’s prior decomposes into a sum of a truncated Fourier (spectral) series and a key-only sink: $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 3 The relative term, $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 4, is constructed via Bochner’s theorem, with basis vectors

$\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 5

such that the dot-product recovers the $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 6-th series component. This design is translation-equivariant and generalizes across sequence lengths. The key-only sink $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 7, implemented as a rank-one “lane,” provides a minimal, content-disentangled query-independent bias. The learned spectral prior is stable under arbitrary sequence scaling.

Empirically, on C4 language modeling (trained to 2048 tokens), GOAT matches or surpasses RoPE and ALiBi in in-distribution perplexity, and extrapolates up to $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 8 context length without degradation. On synthetic long-context retrieval tasks, GOAT maintains near-perfect accuracy well beyond the training window. The learned Fourier prior remains robust under sequence length extension.

6. Comparative Analysis with Other Attention Variants

A comparative summary of attention mechanisms:

Mechanism	Prior Type	Structural Properties	Extrapolation/Inductive Bias
Softmax	Uniform	No structure, emergent content-norm sinks	Poor extrapolation, generic
RoPE	Multiplicative rotations	Content-structure entanglement	Catastrophic out-of-distribution degradation
ALiBi	Linear slope in $\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.$ 9	Fixed, underfits in-distribution, rigid	Extrapolates, limited adaptation
GOAT	Learned additive (Fourier+sink)	Fully expressive, translation-equivariant, disentangled	State-of-the-art extrapolation, stable, plug-and-play

Empirical results indicate that in vision (e.g., ViT-Small at $\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),$ 0), GOAT achieves higher zero-shot accuracy at elevated resolutions than absolute embeddings. In genomics, GOAT matches RoPE in speed, reduces peak GPU memory by 36%, and decreases bits/base metric.

7. Theoretical and Empirical Properties

GOAT encompasses several formal and empirical properties:

Provides a closed-form attention solution $\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),$ 1, sidestepping iterative Sinkhorn procedures.
The spectral plus sink parameterization is identified as the unique, finite-dimensional, SDPA-compatible, translation-equivariant, bounded prior (Theorem 7.1), maximizing entropy recency (Theorem 7.3) and furnishing minimal-rank sinks (Theorem 7.4).
Implementation incurs no additional asymptotic computational or memory cost and interfaces directly with performance-optimized attention kernels.
Empirical tests resolve key trade-offs among expressivity, stability, and efficiency, substantiating state-of-the-art length generalization, stable attention sinks, and improved outcomes in cross-modal domains (Litman et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

You Need Better Attention Priors (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Optimal Transport Attention (Goat).