Papers
Topics
Authors
Recent
Search
2000 character limit reached

AGOT: Attention-Guided Optimal Transport

Updated 4 February 2026
  • Attention-Guided Optimal Transport is a method that combines entropy-regularized optimal transport with attention to produce structured and sparse alignments.
  • It leverages geometric cost matrices and the Sinkhorn algorithm to achieve globally consistent matching in tasks like cross-modal retrieval and network embedding.
  • AGOT is applied in diverse areas from forensic face identification and object-centric modeling to text summarization, demonstrating improved robustness and efficiency.

Attention-Guided Optimal Transport (AGOT) is a class of methods that synthesizes optimal transport (OT) theory and attention mechanisms to produce sparse, semantically meaningful, and globally consistent alignments or matchings between sets, sequences, graphs, or modalities. Unlike conventional attention schemes based solely on dot products or pairwise similarity, AGOT leverages the geometric properties of OT to enable structured, entropy-regularized, and, in some cases, cost-adapted attention. AGOT has been instantiated in diverse research contexts, including cross-modal retrieval, object-centric modeling, network embedding, document summarization, and general-purpose sequence modeling.

1. Mathematical Foundations of Attention-Guided Optimal Transport

The core of AGOT is the entropy-regularized Kantorovich optimal transport problem. Given two finite collections (nodes, tokens, features) represented by distributions μ\mu and ν\nu over their respective supports, and a cost matrix CC such that CijC_{ij} quantifies the dissimilarity between the ii-th element of the source and jj-th element of the target, the entropy-regularized OT seeks a transport plan TT:

T=argminTΠ(μ,ν)T,CεH(T)T^* = \arg\min_{T \in \Pi(\mu, \nu)} \langle T, C \rangle - \varepsilon H(T)

where Π(μ,ν)\Pi(\mu, \nu) is the transportation polytope enforcing marginal constraints, T,C=i,jTijCij\langle T, C \rangle = \sum_{i,j} T_{ij} C_{ij} is the total transport cost, ν\nu0 is the entropy of the plan, and ν\nu1 is a regularization parameter controlling the softness of the transport. This relaxation enables efficient solutions via the Sinkhorn algorithm, which alternates row and column renormalizations of the corresponding Gibbs kernel ν\nu2.

Crucially, attention is guided either in the construction of the cost matrix (where cross-attention is used to refine features prior to cost evaluation, as in the SPOT-Face framework (Prasad et al., 14 Jan 2026)), by parameterizing the plan directly in the logit domain (e.g., through bi-linear forms for textual alignment (Shen et al., 7 Oct 2025)), or by introducing trainable structural priors into the regularizer (as in GOAT (Litman et al., 21 Jan 2026)). In some frameworks, the cost itself is subject to a gradient-based minimization targeting desired entropy properties (MESH (Zhang et al., 2023)).

2. AGOT in Cross-Modal and Graph Matching

SPOT-Face (Prasad et al., 14 Jan 2026) exemplifies AGOT for cross-domain graph matching in forensic face identification, specifically matching skull/skeletal and sketch images to conventional face images. Each image is converted into a superpixel graph, with node features processed by a graph neural network (GNN). Cross-attention refines the node embeddings between modalities:

ν\nu3

A cosine-distance cost matrix is constructed as

ν\nu4

Solving the entropy-regularized OT problem yields a transport plan ν\nu5 encoding probabilistic correspondences between the refined node sets. These correspondences are then pooled to graph-level representations, on which discriminative losses (such as the triplet loss) are computed:

ν\nu6

The entire process, including GNN, attention, and Sinkhorn steps, is differentiable and optimized end-to-end. This approach substantially improves recall and mAP in identifying forensic matches, highlighting AGOT's capacity for learning structured cross-modal alignments.

3. Content-Aware Sparse Attention and Network Embedding

In textual network embedding (Chen et al., 2019), AGOT replaces standard dot-product attention between token sequences with a transport plan solving

ν\nu7

for ν\nu8 typically representing squared Euclidean or cosine distance between token embeddings of two nodes' associated texts. The resulting ν\nu9 provides a context-sensitive, sparse, and self-normalized alignment. Optionally, a CNN-based attention parser further processes CC0 to extract higher-level global or structural alignment features, leading to gains in link prediction and node classification tasks.

4. Object-Centric Modeling: Hard Assignments and Entropy Control

Slot attention mechanisms have been recast as a one-step Sinkhorn solution to the entropy-regularized OT (Zhang et al., 2023). The unregularized case (CC1) yields hard, one-to-one assignments crucial for object-centric representations in complex scenes and dynamic videos, enabling unambiguous slot-to-object mappings. The MESH module interpolates between soft and hard assignments by optimizing the cost matrix to drive down the entropy of the Sinkhorn plan, thus recovering the tiebreaking and exclusive-equivalence properties of hard OT while retaining the parallelism and smooth gradients of the regularized regime. This approach has led to substantial improvements in metrics such as mAP and segmentation metrics across multiple object discovery benchmarks.

5. Informative Attention, Reference Grounding, and Text Generation

InforME (Shen et al., 7 Oct 2025) leverages a form of AGOT tailored for abstractive summarization. Given encoder (source) and decoder (summary) representations, it constructs a transport plan CC2 by a bi-linear parameterization and row-wise softmax, aligning source tokens to reference summary tokens. The resulting plan provides "reverse attention" scores quantifying the informativeness of each source token relative to the reference. These scores are integrated into the standard cross-attention via convex fusion, leading to improved coverage of salient content. Complementing this, the Accumulative Joint Entropy Reduction (AJER) mechanism uses conditional sequence entropy to regularize named entity representations, further enhancing summary informativeness and factual consistency.

Key empirical results (on CNN/DailyMail):

Model Variant R-1 / R-2 / R-L Human-Informativeness (CNNDM)
BART-large baseline 44.16/21.28/40.90 18.3%
+ OT only 44.67/21.16/41.59
+ AJER only 44.67/21.31/41.58
OT + AJER (InforME) 44.75/21.54/41.69 30.0%

The global nature of the OT plan under AGOT ensures that salient but contextually rare information is upweighted, a property not achieved by vanilla cross-attention.

6. Generalized Attention with Trainable Priors: The GOAT Framework

The standard softmax attention can be derived as a special case of a one-sided entropic OT with uniform prior (Litman et al., 21 Jan 2026). GOAT generalizes this by incorporating arbitrary, learning-adapted priors into the entropic regularizer:

CC3

with explicit solution

CC4

The prior structure CC5 is parameterized to capture relative positional biases via shift-invariant (Fourier) or key-only sink terms. This enhances expressiveness (e.g., modeling arbitrary periodicities), enables learned stabilization against "attention sinks" (keys with overwhelming probability under low content signal), and supports length extrapolation without the pathologies of fixed-form position encodings.

Empirical results indicate superior perplexity, compositional generalization, and computational efficiency—confirmed on language modeling, retrieval, and vision tasks—while fully retaining compatibility with optimized attention kernels such as FlashAttention.

7. Implementation Strategies, Efficiency, and Empirical Observations

AGOT is consistently implemented with a focus on end-to-end differentiability. Most frameworks employ the Sinkhorn-Knopp algorithm (5–80 iterations are typical depending on the domain and plan size), with moderate entropy regularization CC6. Cost matrices are either constructed directly from cross-attended or bi-linearly paired features, or further optimized for specific entropy behavior (as in MESH). Batch sizes, embedding dimensions, and GNN or Transformer backbones are selected in line with task scale and complexity.

Empirically, AGOT methods demonstrate:

  • Sparse, globally consistent alignments replacing diffuse local attention.
  • Enhanced robustness to ambiguity and noise, by enforcing explicit marginal constraints.
  • Clear gains in cross-modal retrieval, node/graph classification, segmentation, summarization informativeness, and long-sequence modeling when compared with non-OT attention or unstructured matching approaches.
  • Compatibility with modern high-speed attention kernels and scalable computation.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Guided Optimal Transport (AGOT).