Sparse Sinkhorn Attention

Updated 18 November 2025

Sparse Sinkhorn Attention is a class of differentiable attention mechanisms that leverages the Sinkhorn operator to generate soft permutations for adaptive, sparse receptive fields.
It employs block-wise permutation and dynamic sequence truncation (SortCut) in Transformer and GNN architectures to reduce computational complexity while ensuring quasi-global context.
Empirical evaluations demonstrate competitive performance against full and local attention methods, highlighting efficiency gains and the potential for further structural optimizations.

Sparse Sinkhorn Attention is a class of attention mechanisms that combine differentiable sorting via the Sinkhorn operator with the construction of sparse or softly sparse receptive fields. By leveraging block-wise or structural permutation and optimal transport-driven selection, these methods achieve quasi-global attention patterns while reducing the computational and memory complexity typical of the standard self-attention paradigm. This family includes both sequence-based Transformer architectures and graph neural network (GNN) extensions, unified by their reliance on entropic optimal transport and end-to-end differentiability (Tay et al., 2020, Ding et al., 11 Feb 2024).

1. Architectural Foundations in Sequence Models

Sparse Sinkhorn Attention, as introduced in Transformer models, operates on an input sequence $X$ of length $\ell$ and embedding dimension $d$ by partitioning $X$ into $N_B$ non-overlapping blocks of size $B = \ell / N_B$ . Rather than relying on fixed local or pre-determined sparse attention patterns, a meta sorting network ("SortNet") computes a score vector for each block, yielding a non-negative sorting logits matrix $R \in \mathbb{R}^{N_B \times N_B}$ .

These logits are converted to a doubly stochastic matrix $S(R)$ using the Sinkhorn normalization, which approximates a permutation matrix under entropic relaxation. The resulting "soft permutation" is applied to the input, effectively rearranging blocks according to learned, input-dependent priorities. Standard block-local dot-product attention is then performed on the permuted sequence blocks, granting each token access to non-local context at a reduced computational cost.

To address the requirements of autoregressive decoding, a causal variant of both block pooling and Sinkhorn normalization is introduced. For pure encoding, the SortCut algorithm enables dynamic sequence truncation by selecting only the top $n$ sorted blocks for attention, further scaling down memory requirements (Tay et al., 2020).

2. Differentiable Sorting via the Sinkhorn Operator

The core of Sparse Sinkhorn Attention is its use of the Sinkhorn operator for soft sorting. For a non-negative matrix $R$ , the Sinkhorn iterations alternately normalize rows and columns to produce a matrix in the Birkhoff polytope (the set of doubly stochastic matrices), serving as a soft approximation to permutation matrices.

The update rules are:

$S^0(R) = \exp(R)$ ;
$S^{k}(R) = F_c(F_r(S^{k-1}(R)))$ , where $F_r$ and $F_c$ are row and column normalizations, respectively.

In log-space for numerical stability:

$\log F_r(\log X) = \log X - \log( \exp(X) 1 )$
$\log F_c(\log X) = \log X - \log( 1^T \exp(X) )$

Causal Sinkhorn Balancing adapts these normalizations by masking out future-information pathways in the normalization process to enforce autoregressive constraints. The Sinkhorn iterations are typically run for a small fixed number $K$ (e.g., $K=10$ ) steps (Tay et al., 2020).

3. Dynamic Sequence and Subgraph Truncation

SortCut, an algorithmic innovation in the sequence domain, enables the dynamic selection of the most relevant tokens by truncating to the top $n$ permutation-sorted blocks post-Sinkhorn. This reduces attention computation to the most salient sub-sequence, scaling memory complexity closer to linear in $\ell$ when $n \ll N_B$ . The process involves block-pooling, scoring with SortNet, Sinkhorn normalization, soft permutation, and finally local attention over a truncated set (Tay et al., 2020).

Analogous ideas are extended to graphs in the GSINA framework, where edge importance scores are soft-assigned to "invariant" or "variant" bins via a 2-row Sinkhorn-normalized transport plan, with a sparsity parameter $r$ controlling the expected fraction of preserved edges. Node-level attention aggregates these edge attentions for downstream prediction (Ding et al., 11 Feb 2024).

4. Integration with Transformer and GNN Architectures

In Transformer models, Sparse Sinkhorn Attention is implemented per-head. Each attention head learns its own sort matrix and corresponding permutation. The attention mechanism operates locally within blocks post-permutation. For mixture variants, global and local-attended outputs can be combined.

In graph neural networks, the Sinkhorn-OT machinery is deployed over edges, scoring each with a GNN+MLP subsystem and performing entropic optimal transport between edge scores and the two bins. The resulting edge attention weights $\alpha^E$ reweight message passing, while node attention $\alpha^V$ aggregates by neighborhood. The architecture remains fully differentiable, allowing end-to-end training with the usual task losses (Ding et al., 11 Feb 2024).

5. Algorithmic and Implementation Details

Key hyperparameters in the sequence model are:

Temperature $\tau \in \{0.25, 0.5, 0.75, 1.0\}$
Sinkhorn iterations $K \in \{2, 5, 10, 20\}$
Block size $B$ , selected per task (e.g., $B = 16,\ 32,\ 64$ )
SortCut truncation budget $n$ (number of blocks)
Gumbel-Sinkhorn variant, with Gumbel noise $\epsilon$ added to logits

Sinkhorn normalization is performed in log-space for stability. The SortNet scoring network is typically a single linear layer, with deeper variants offering no improvement. Models were implemented in Tensor2Tensor and trained on both TPUs and GPUs. Sinkhorn Mixture variants compute a convex combination of block-sorted and unsorted attention outputs (Tay et al., 2020).

In the GSINA framework for GNNs, critical hyperparameters include sparsity $r$ , entropic regularizer $\tau$ , Gumbel noise parameter $\sigma$ , and Sinkhorn iterations $n$ . Default values in experiments are $\tau=1$ , $n=10$ , $\sigma=1$ in training and 0 for inference. The complexity per Sinkhorn iteration is linear in the number of edges $N_e$ . Both methods are trainable end-to-end via backpropagation through the Sinkhorn steps (Ding et al., 11 Feb 2024).

6. Empirical Evaluation

Sparse Sinkhorn Attention has been evaluated across multiple tasks and domains:

Task/Metric	Baseline/Comparison	Sparse Sinkhorn Variant(s)
Algorithmic Seq2Seq Sorting (Edit/EM%)	Full Transformer: 0.4252/45.69	B=32: 0.4054/49.24
Word-level LM1B Perplexity	Full: 41.57 (Base), 27.59 (Big)	Sinkhorn: 40.79 (Base)
Char-level LM1B Bytes-per-char	Full: 1.283 (Base), 1.121 (Big)	Sinkhorn: 1.295 (Base)
CIFAR-10 Pixel-wise Bpd	Full: 3.198	Sinkhorn: 3.197
Document Classification (IMDb/SST, acc)	Full: 85.12/76.83 (Word)	Sinkhorn: 83.54/77.52
Natural Language Inference (SNLI)	Full: 78.87	Sinkhorn: 78.62

In graph settings (GSINA), controlling $r$ and $\tau$ allows interpolation between hard top- $k$ pruning and uniform inclusion. Empirical accuracy sweeps show the accuracy-maximizing $r$ lies typically in $0.3$–$0.7$, confirming the importance of tuning the sparsity–softness tradeoff (Ding et al., 11 Feb 2024).

7. Advantages, Limitations, and Prospective Extensions

Key advantages:

Memory complexity per layer is reduced from $O(\ell^2)$ (full attention) to $O(B^2 + (\ell/B)^2)$ or $O(\ell)$ with aggressive SortCut truncation for sequence models.
Receptive fields are learned and data-dependent rather than fixed or random.
Differentiable sorting (via Sinkhorn) enables gradient-based learning of sparse patterns and soft reordering, endowing the model with novel inductive biases.
Empirically, Sparse Sinkhorn Attention achieves competitive or superior performance relative to local, sparse, and full-attention baselines.

Limitations:

Computational overhead due to Sinkhorn iterations ( $K$ steps) and possible Gumbel noise sampling.
If parameters (temperature $\tau$ and $K$ ) are poorly tuned, soft permutations may lead to blurred semantic boundaries.
The autoregressive (causal) variant requires masked normalization which is less efficient than the standard non-causal case.

Potential extensions:

Learning a convex mixture of sorted and unsorted attention weights.
Dynamic block sizing or multi-level (hierarchical) sorting.
Integration with LSH-based or recurrence-augmented efficient transformers.
Regularization on $S(R)$ to encourage hard permutations when beneficial (Tay et al., 2020).

A plausible implication is that the Sinkhorn-permutation and OT-based attention paradigm may be adapted to any domain where structural sparsity and adaptive, differentiable selection are desired, extending beyond sequence and graph modalities.

PDF Markdown Chat (Pro)

References (2)

Sparse Sinkhorn Attention (2020)

GSINA: Improving Subgraph Extraction for Graph Invariant Learning via Graph Sinkhorn Attention (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse Sinkhorn Attention.