Semantic Optimal Transport Attention

Updated 17 September 2025

Semantic optimal transport attention is a mechanism that reframes aspect-context alignment as a discrete optimal transport problem to capture nonlinear semantic dependencies.
It leverages the Sinkhorn algorithm with entropy regularization to compute cost-based attention weights that downweight noisy tokens.
The fusion with syntactic graph-aware attention and contrastive regularization leads to robust improvements in sentiment analysis benchmarks.

Semantic optimal transport attention (SOTA) combines optimal transport theory with deep attention mechanisms to achieve fine-grained, nonlinear semantic alignment between elements in natural language sequences, and is designed to overcome the limitations of standard attention models—particularly in aspect-based sentiment analysis, where capturing relationships between aspect terms and opinion words amidst noisy or irrelevant context is essential.

1. Mathematical Formulation of Semantic Optimal Transport Attention

Semantic optimal transport attention is formulated by recasting the aspect–context alignment problem as a discrete optimal transport instance. Given context word embeddings $H^s \in \mathbb{R}^{n \times d}$ (from BERT), and an aggregated aspect embedding $h_a' \in \mathbb{R}^d$ (computed via average pooling over aspect tokens),

$h_a' = \frac{1}{m} \sum_{j=1}^{m} h_{\text{pos}(a_j)}$

the semantic dissimilarity (cost) between each context token and the aspect center is computed using cosine distance: $\text{Cost} = 1 - \frac{H^s (h_a')^\top}{\|H^s\| \|h_a'\|}$

These cost values populate the cost matrix $C \in \mathbb{R}^{n \times 1}$ . The context and aspect center are further transformed into discrete probability distributions: $\mu = \text{softmax}(F_\mu(H^s)), \qquad \nu = [1] \in \Delta^1$ where $F_\mu$ is a feedforward layer and $\Delta^1$ denotes the 1-dimensional probability simplex.

Each attention head $k$ utilizes a regularization parameter $\varepsilon^k$ to control the entropy of the transport plan. The Sinkhorn kernel is defined as

$K^k = \exp(-\text{Cost}/\varepsilon^k)$

and the dual variables are iteratively updated: $u \leftarrow \mu/(K^k v), \qquad v \leftarrow \nu/((K^k)^\top u)$ The optimal transport plan (attention weights) is then computed as

$\pi^k = \text{diag}(u) K^k \text{diag}(v)$

The resulting semantic attention $\mathbf{A}_{\mathrm{OT}}^k = \pi^k$ selectively weights context tokens by their transport cost to the aspect semantic center, producing fine-grained, nonlinear alignments.

2. Modeling Nonlinear Semantic Dependencies

Unlike dot-product or cosine similarity attention, semantic OT attention directly minimizes transport cost, inherently supporting nonlinear and one-to-many correspondences. The entropy parameter $\varepsilon^k$ per head allows tuning between sharp, selective alignments (low entropy) and smooth, distributed ones (high entropy), enabling the multi-head architecture to model varying granularities of semantic association. This approach is robust to textual noise and confounding tokens, as the transport plan naturally downweights irrelevant words and highlights those most cost-efficiently aligned with the aspect.

3. Integration with Syntactic and Semantic Channels

OTESGN employs both a syntactic graph-aware attention (SGAA), leveraging dependency tree structure to capture latent syntactic dependencies, and SOTA for semantic matching. The adaptive attention fusion (AAF) module integrates these heterogeneous signals: $A^k = \beta \cdot A_{\mathrm{SG}}^k + (1-\beta) \cdot A_{\mathrm{OT\_mat}}^k$ where $A_{\mathrm{SG}}^k$ is the structure-aware attention matrix from the syntactic graph, $A_{\mathrm{OT\_mat}}^k$ is the OT-based attention broadcast along columns, and $\beta$ is a learnable fusion parameter typically converging to 0.1–0.3. The fusion ensures both syntactic topology and semantic cost-based matching contribute dynamically to the attention distribution.

4. Contrastive Regularization for Robustness

Contrastive regularization further improves feature discriminability by forcing aspect representations (after pooling and attention) for samples with the same sentiment to become closer, and those of differing sentiment to be repulsed. The regularization is formalized as: $\mathcal{L}_c(\theta) = -\frac{1}{K} \sum_{i \in \mathcal{I}} \log \frac{ \exp(\text{sim}(h_{a,\mathrm{pool}}^i, h_{a,\mathrm{pool}}^{i^+})/\tau) }{ \sum_{j \in \mathcal{I}} \exp(\text{sim}(h_{a,\mathrm{pool}}^i, h_{a,\mathrm{pool}}^j)/\tau) }$ where $\text{sim}(x, y)$ denotes cosine similarity, $h_{a,\mathrm{pool}}$ is the pooled aspect representation after attention, $\tau$ is the temperature parameter, and $\mathcal{I}$ denotes the batch. This loss encourages label-aware clustering in representation space and amplifies resistance to label noise and sentiment ambiguity.

5. Performance and Empirical Impact

OTESGN, which integrates semantic OT attention, syntactic attention, adaptive fusion, and contrastive regularization, achieves robust improvements in aspect-based sentiment analysis benchmarks. The method reports macro-F1 score gains of +1.01% on Twitter and +1.30% on Laptop14 over previous state-of-the-art. Visual analysis demonstrates more precise activation for opinion-bearing words and increased robustness to irrelevant context. A key benefit cited is either improved or preserved localization of sentiment-predictive tokens despite text noise or long-range dependencies in social and informal domains.

6. Conclusion and Significance

Semantic optimal transport attention operationalizes fine-grained, cost-based alignment between context and aspect representations by solving a constrained transport problem using the Sinkhorn algorithm, parameterized appropriately to trade off sharpness and spread in the attention distribution. In the OTESGN framework, its fusion with syntactic graph-aware attention and regularized by contrastive learning achieves superior association between aspect terms and sentiment cues, and demonstrates empirical improvements in noisy and structurally complex textual data. The successful application in ABSA benchmarks suggests promising directions for SOTA extensions to other tasks requiring resilience to noisy alignments and nonlinear semantic dependencies, including broader opinion mining and multimodal sentiment inference.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Semantic Optimal Transport Attention.