Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Optimal Transport Attention (GOAT)

Updated 5 May 2026
  • The paper introduces GOAT, a novel attention mechanism that reformulates attention as a one-sided entropic optimal transport problem using a trainable prior.
  • The methodology leverages a closed-form solution, integrating Fourier-based spectral components with a learnable sink to effectively incorporate positional biases.
  • Empirical results demonstrate that GOAT enhances stability, improves extrapolation, and outperforms models like RoPE and ALiBi while maintaining efficiency on optimized kernels.

Generalized Optimal Transport Attention (GOAT) is a reformulation of the attention mechanism in neural networks, generalizing standard scaled-dot product attention by introducing a trainable, continuous prior under the Entropic Optimal Transport (EOT) framework. GOAT replaces the implicit uniform prior in classical attention with a learnable prior that enables more expressive structural inductive bias, providing enhanced stability, extrapolation, and efficiency while remaining fully compatible with fast attention kernels such as FlashAttention. This formulation yields a closed-form solution for the optimal transport problem and offers new insights into the behavior of attention sinks, incorporating positional and spatial priors directly into the attention computation (Litman et al., 21 Jan 2026).

1. Attention as Entropic Optimal Transport

Standard attention mechanisms compute a probability distribution over value vectors for each query via softmax normalization. GOAT casts this process as an instance of one-sided Entropic Optimal Transport. Let LL denote context length and sRL\bm{s}\in\mathbb{R}^L the vector of unnormalized dot-product scores, sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}. The EOT objective with Shannon entropy regularization seeks

p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.

This is equivalent to minimizing

minpΔ p,s+τKL(pU),\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),

where U\mathcal{U} is the uniform prior. The solution recovers the conventional softmax: pj=exp(sj/τ)kexp(sk/τ).p_j^\star = \frac{\exp(s_j/\tau)}{\sum_{k} \exp(s_k/\tau)}. This establishes that standard attention is an instance of EOT regularized by a uniform prior.

2. Generalizing with a Trainable Prior

GOAT introduces a generalized EOT objective that replaces the uniform prior with a learnable continuous prior, πΔL1\bm{\pi} \in \Delta^{L-1}: p=argminpΔL1{p,s+τKL(pπ)}.\bm{p}^\star = \arg\min_{p\in\Delta^{L-1}} \left\{ -\langle p, s \rangle + \tau KL(p \| \pi) \right\}. The analytic solution is

pj=πjexp(sj/τ)kπkexp(sk/τ),p_j^\star = \frac{\pi_j \exp(s_j / \tau)}{\sum_k \pi_k \exp(s_k / \tau)},

which can be written as sRL\bm{s}\in\mathbb{R}^L0. This design retains a closed-form solution in the one-sided scenario, removing the need for iterative Sinkhorn steps that arise in the general two-marginal EOT case. The trainable prior sRL\bm{s}\in\mathbb{R}^L1 enables the network to learn context- or position-dependent structural information, surpassing the limitations of a uniform prior.

3. EOT Perspective on Attention Sinks

In attention, sinks are locations that retain persistent attention weights regardless of context signal. From the EOT view, total logits are defined as sRL\bm{s}\in\mathbb{R}^L2, with normalized prior sRL\bm{s}\in\mathbb{R}^L3. The dynamic-range signal sRL\bm{s}\in\mathbb{R}^L4 quantifies the strength of the content-based scores.

Theorem 5.1 (Collapse to Prior) shows: sRL\bm{s}\in\mathbb{R}^L5 In the low-signal regime, the posterior collapses onto the prior. Sinks are characterized by margin (Definition 5.2): for query sRL\bm{s}\in\mathbb{R}^L6, key sRL\bm{s}\in\mathbb{R}^L7 is a sink if sRL\bm{s}\in\mathbb{R}^L8, ensuring sRL\bm{s}\in\mathbb{R}^L9 is lower-bounded independently of context length sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}0.

Theorem 5.3 contrasts prior types for context sensitivity. With uniform prior, in the low-signal limit sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}1 as sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}2 grows. In contrast, a peaked prior with margin sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}3 imposes

sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}4

demonstrating that explicit learned sinks can exponentially suppress context noise.

4. FlashAttention Compatibility and Implementation

GOAT leverages a parameterization where the log prior sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}5 is embedded directly within the attention mechanism, maintaining compatibility with highly optimized kernels such as FlashAttention. Each attention head is split into a content subspace (sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}6 dimensions) and a positional subspace (sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}7 dimensions), with sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}8. The augmented queries and keys are: sj=qc,i,kc,j/dcs_j = \langle q_{c,i}, k_{c,j} \rangle / \sqrt{d_c}9

p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.0

so that a standard SDPA call p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.1 automatically realizes

p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.2

with no additional asymptotic or memory cost. GOAT thereby functions as a drop-in replacement in standard and acceleration-optimized attention layers.

5. Structural Priors and Length Extrapolation

GOAT’s prior decomposes into a sum of a truncated Fourier (spectral) series and a key-only sink: p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.3 The relative term, p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.4, is constructed via Bochner’s theorem, with basis vectors

p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.5

such that the dot-product recovers the p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.6-th series component. This design is translation-equivariant and generalizes across sequence lengths. The key-only sink p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.7, implemented as a rank-one “lane,” provides a minimal, content-disentangled query-independent bias. The learned spectral prior is stable under arbitrary sequence scaling.

Empirically, on C4 language modeling (trained to 2048 tokens), GOAT matches or surpasses RoPE and ALiBi in in-distribution perplexity, and extrapolates up to p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.8 context length without degradation. On synthetic long-context retrieval tasks, GOAT maintains near-perfect accuracy well beyond the training window. The learned Fourier prior remains robust under sequence length extension.

6. Comparative Analysis with Other Attention Variants

A comparative summary of attention mechanisms:

Mechanism Prior Type Structural Properties Extrapolation/Inductive Bias
Softmax Uniform No structure, emergent content-norm sinks Poor extrapolation, generic
RoPE Multiplicative rotations Content-structure entanglement Catastrophic out-of-distribution degradation
ALiBi Linear slope in p=argminpΔL1{p,sτH(p)},H(p)=jpjlogpj, τ>0.\bm{p}^\star = \arg\min_{\bm{p}\in\Delta^{L-1}} \big\{\langle \bm{p}, -\bm{s}\rangle - \tau\,H(\bm{p})\big\}, \quad H(\bm{p})=-\sum_j p_j\log p_j,~\tau>0.9 Fixed, underfits in-distribution, rigid Extrapolates, limited adaptation
GOAT Learned additive (Fourier+sink) Fully expressive, translation-equivariant, disentangled State-of-the-art extrapolation, stable, plug-and-play

Empirical results indicate that in vision (e.g., ViT-Small at minpΔ p,s+τKL(pU),\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),0), GOAT achieves higher zero-shot accuracy at elevated resolutions than absolute embeddings. In genomics, GOAT matches RoPE in speed, reduces peak GPU memory by 36%, and decreases bits/base metric.

7. Theoretical and Empirical Properties

GOAT encompasses several formal and empirical properties:

  • Provides a closed-form attention solution minpΔ p,s+τKL(pU),\min_{p\in\Delta}~\langle p, -s \rangle + \tau KL(p \| \mathcal{U}),1, sidestepping iterative Sinkhorn procedures.
  • The spectral plus sink parameterization is identified as the unique, finite-dimensional, SDPA-compatible, translation-equivariant, bounded prior (Theorem 7.1), maximizing entropy recency (Theorem 7.3) and furnishing minimal-rank sinks (Theorem 7.4).
  • Implementation incurs no additional asymptotic computational or memory cost and interfaces directly with performance-optimized attention kernels.
  • Empirical tests resolve key trade-offs among expressivity, stability, and efficiency, substantiating state-of-the-art length generalization, stable attention sinks, and improved outcomes in cross-modal domains (Litman et al., 21 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Optimal Transport Attention (Goat).