AGOT: Attention-Guided Optimal Transport
- Attention-Guided Optimal Transport is a method that combines entropy-regularized optimal transport with attention to produce structured and sparse alignments.
- It leverages geometric cost matrices and the Sinkhorn algorithm to achieve globally consistent matching in tasks like cross-modal retrieval and network embedding.
- AGOT is applied in diverse areas from forensic face identification and object-centric modeling to text summarization, demonstrating improved robustness and efficiency.
Attention-Guided Optimal Transport (AGOT) is a class of methods that synthesizes optimal transport (OT) theory and attention mechanisms to produce sparse, semantically meaningful, and globally consistent alignments or matchings between sets, sequences, graphs, or modalities. Unlike conventional attention schemes based solely on dot products or pairwise similarity, AGOT leverages the geometric properties of OT to enable structured, entropy-regularized, and, in some cases, cost-adapted attention. AGOT has been instantiated in diverse research contexts, including cross-modal retrieval, object-centric modeling, network embedding, document summarization, and general-purpose sequence modeling.
1. Mathematical Foundations of Attention-Guided Optimal Transport
The core of AGOT is the entropy-regularized Kantorovich optimal transport problem. Given two finite collections (nodes, tokens, features) represented by distributions and over their respective supports, and a cost matrix such that quantifies the dissimilarity between the -th element of the source and -th element of the target, the entropy-regularized OT seeks a transport plan :
where is the transportation polytope enforcing marginal constraints, is the total transport cost, 0 is the entropy of the plan, and 1 is a regularization parameter controlling the softness of the transport. This relaxation enables efficient solutions via the Sinkhorn algorithm, which alternates row and column renormalizations of the corresponding Gibbs kernel 2.
Crucially, attention is guided either in the construction of the cost matrix (where cross-attention is used to refine features prior to cost evaluation, as in the SPOT-Face framework (Prasad et al., 14 Jan 2026)), by parameterizing the plan directly in the logit domain (e.g., through bi-linear forms for textual alignment (Shen et al., 7 Oct 2025)), or by introducing trainable structural priors into the regularizer (as in GOAT (Litman et al., 21 Jan 2026)). In some frameworks, the cost itself is subject to a gradient-based minimization targeting desired entropy properties (MESH (Zhang et al., 2023)).
2. AGOT in Cross-Modal and Graph Matching
SPOT-Face (Prasad et al., 14 Jan 2026) exemplifies AGOT for cross-domain graph matching in forensic face identification, specifically matching skull/skeletal and sketch images to conventional face images. Each image is converted into a superpixel graph, with node features processed by a graph neural network (GNN). Cross-attention refines the node embeddings between modalities:
3
A cosine-distance cost matrix is constructed as
4
Solving the entropy-regularized OT problem yields a transport plan 5 encoding probabilistic correspondences between the refined node sets. These correspondences are then pooled to graph-level representations, on which discriminative losses (such as the triplet loss) are computed:
6
The entire process, including GNN, attention, and Sinkhorn steps, is differentiable and optimized end-to-end. This approach substantially improves recall and mAP in identifying forensic matches, highlighting AGOT's capacity for learning structured cross-modal alignments.
3. Content-Aware Sparse Attention and Network Embedding
In textual network embedding (Chen et al., 2019), AGOT replaces standard dot-product attention between token sequences with a transport plan solving
7
for 8 typically representing squared Euclidean or cosine distance between token embeddings of two nodes' associated texts. The resulting 9 provides a context-sensitive, sparse, and self-normalized alignment. Optionally, a CNN-based attention parser further processes 0 to extract higher-level global or structural alignment features, leading to gains in link prediction and node classification tasks.
4. Object-Centric Modeling: Hard Assignments and Entropy Control
Slot attention mechanisms have been recast as a one-step Sinkhorn solution to the entropy-regularized OT (Zhang et al., 2023). The unregularized case (1) yields hard, one-to-one assignments crucial for object-centric representations in complex scenes and dynamic videos, enabling unambiguous slot-to-object mappings. The MESH module interpolates between soft and hard assignments by optimizing the cost matrix to drive down the entropy of the Sinkhorn plan, thus recovering the tiebreaking and exclusive-equivalence properties of hard OT while retaining the parallelism and smooth gradients of the regularized regime. This approach has led to substantial improvements in metrics such as mAP and segmentation metrics across multiple object discovery benchmarks.
5. Informative Attention, Reference Grounding, and Text Generation
InforME (Shen et al., 7 Oct 2025) leverages a form of AGOT tailored for abstractive summarization. Given encoder (source) and decoder (summary) representations, it constructs a transport plan 2 by a bi-linear parameterization and row-wise softmax, aligning source tokens to reference summary tokens. The resulting plan provides "reverse attention" scores quantifying the informativeness of each source token relative to the reference. These scores are integrated into the standard cross-attention via convex fusion, leading to improved coverage of salient content. Complementing this, the Accumulative Joint Entropy Reduction (AJER) mechanism uses conditional sequence entropy to regularize named entity representations, further enhancing summary informativeness and factual consistency.
Key empirical results (on CNN/DailyMail):
| Model Variant | R-1 / R-2 / R-L | Human-Informativeness (CNNDM) |
|---|---|---|
| BART-large baseline | 44.16/21.28/40.90 | 18.3% |
| + OT only | 44.67/21.16/41.59 | |
| + AJER only | 44.67/21.31/41.58 | |
| OT + AJER (InforME) | 44.75/21.54/41.69 | 30.0% |
The global nature of the OT plan under AGOT ensures that salient but contextually rare information is upweighted, a property not achieved by vanilla cross-attention.
6. Generalized Attention with Trainable Priors: The GOAT Framework
The standard softmax attention can be derived as a special case of a one-sided entropic OT with uniform prior (Litman et al., 21 Jan 2026). GOAT generalizes this by incorporating arbitrary, learning-adapted priors into the entropic regularizer:
3
with explicit solution
4
The prior structure 5 is parameterized to capture relative positional biases via shift-invariant (Fourier) or key-only sink terms. This enhances expressiveness (e.g., modeling arbitrary periodicities), enables learned stabilization against "attention sinks" (keys with overwhelming probability under low content signal), and supports length extrapolation without the pathologies of fixed-form position encodings.
Empirical results indicate superior perplexity, compositional generalization, and computational efficiency—confirmed on language modeling, retrieval, and vision tasks—while fully retaining compatibility with optimized attention kernels such as FlashAttention.
7. Implementation Strategies, Efficiency, and Empirical Observations
AGOT is consistently implemented with a focus on end-to-end differentiability. Most frameworks employ the Sinkhorn-Knopp algorithm (5–80 iterations are typical depending on the domain and plan size), with moderate entropy regularization 6. Cost matrices are either constructed directly from cross-attended or bi-linearly paired features, or further optimized for specific entropy behavior (as in MESH). Batch sizes, embedding dimensions, and GNN or Transformer backbones are selected in line with task scale and complexity.
Empirically, AGOT methods demonstrate:
- Sparse, globally consistent alignments replacing diffuse local attention.
- Enhanced robustness to ambiguity and noise, by enforcing explicit marginal constraints.
- Clear gains in cross-modal retrieval, node/graph classification, segmentation, summarization informativeness, and long-sequence modeling when compared with non-OT attention or unstructured matching approaches.
- Compatibility with modern high-speed attention kernels and scalable computation.
References
- SPOT-Face: Forensic Face Identification using Attention Guided Optimal Transport (Prasad et al., 14 Jan 2026)
- Improving Textual Network Embedding with Global Attention via Optimal Transport (Chen et al., 2019)
- Unlocking Slot Attention by Changing Optimal Transport Costs (Zhang et al., 2023)
- InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience (Shen et al., 7 Oct 2025)
- You Need Better Attention Priors (Litman et al., 21 Jan 2026)