Papers
Topics
Authors
Recent
2000 character limit reached

Sparse Sinkhorn Attention

Updated 18 November 2025
  • Sparse Sinkhorn Attention is a class of differentiable attention mechanisms that leverages the Sinkhorn operator to generate soft permutations for adaptive, sparse receptive fields.
  • It employs block-wise permutation and dynamic sequence truncation (SortCut) in Transformer and GNN architectures to reduce computational complexity while ensuring quasi-global context.
  • Empirical evaluations demonstrate competitive performance against full and local attention methods, highlighting efficiency gains and the potential for further structural optimizations.

Sparse Sinkhorn Attention is a class of attention mechanisms that combine differentiable sorting via the Sinkhorn operator with the construction of sparse or softly sparse receptive fields. By leveraging block-wise or structural permutation and optimal transport-driven selection, these methods achieve quasi-global attention patterns while reducing the computational and memory complexity typical of the standard self-attention paradigm. This family includes both sequence-based Transformer architectures and graph neural network (GNN) extensions, unified by their reliance on entropic optimal transport and end-to-end differentiability (Tay et al., 2020, Ding et al., 11 Feb 2024).

1. Architectural Foundations in Sequence Models

Sparse Sinkhorn Attention, as introduced in Transformer models, operates on an input sequence XX of length \ell and embedding dimension dd by partitioning XX into NBN_B non-overlapping blocks of size B=/NBB = \ell / N_B. Rather than relying on fixed local or pre-determined sparse attention patterns, a meta sorting network ("SortNet") computes a score vector for each block, yielding a non-negative sorting logits matrix RRNB×NBR \in \mathbb{R}^{N_B \times N_B}.

These logits are converted to a doubly stochastic matrix S(R)S(R) using the Sinkhorn normalization, which approximates a permutation matrix under entropic relaxation. The resulting "soft permutation" is applied to the input, effectively rearranging blocks according to learned, input-dependent priorities. Standard block-local dot-product attention is then performed on the permuted sequence blocks, granting each token access to non-local context at a reduced computational cost.

To address the requirements of autoregressive decoding, a causal variant of both block pooling and Sinkhorn normalization is introduced. For pure encoding, the SortCut algorithm enables dynamic sequence truncation by selecting only the top nn sorted blocks for attention, further scaling down memory requirements (Tay et al., 2020).

2. Differentiable Sorting via the Sinkhorn Operator

The core of Sparse Sinkhorn Attention is its use of the Sinkhorn operator for soft sorting. For a non-negative matrix RR, the Sinkhorn iterations alternately normalize rows and columns to produce a matrix in the Birkhoff polytope (the set of doubly stochastic matrices), serving as a soft approximation to permutation matrices.

The update rules are:

  • S0(R)=exp(R)S^0(R) = \exp(R);
  • Sk(R)=Fc(Fr(Sk1(R)))S^{k}(R) = F_c(F_r(S^{k-1}(R))), where FrF_r and FcF_c are row and column normalizations, respectively.

In log-space for numerical stability:

  • logFr(logX)=logXlog(exp(X)1)\log F_r(\log X) = \log X - \log( \exp(X) 1 )
  • logFc(logX)=logXlog(1Texp(X))\log F_c(\log X) = \log X - \log( 1^T \exp(X) )

Causal Sinkhorn Balancing adapts these normalizations by masking out future-information pathways in the normalization process to enforce autoregressive constraints. The Sinkhorn iterations are typically run for a small fixed number KK (e.g., K=10K=10) steps (Tay et al., 2020).

3. Dynamic Sequence and Subgraph Truncation

SortCut, an algorithmic innovation in the sequence domain, enables the dynamic selection of the most relevant tokens by truncating to the top nn permutation-sorted blocks post-Sinkhorn. This reduces attention computation to the most salient sub-sequence, scaling memory complexity closer to linear in \ell when nNBn \ll N_B. The process involves block-pooling, scoring with SortNet, Sinkhorn normalization, soft permutation, and finally local attention over a truncated set (Tay et al., 2020).

Analogous ideas are extended to graphs in the GSINA framework, where edge importance scores are soft-assigned to "invariant" or "variant" bins via a 2-row Sinkhorn-normalized transport plan, with a sparsity parameter rr controlling the expected fraction of preserved edges. Node-level attention aggregates these edge attentions for downstream prediction (Ding et al., 11 Feb 2024).

4. Integration with Transformer and GNN Architectures

In Transformer models, Sparse Sinkhorn Attention is implemented per-head. Each attention head learns its own sort matrix and corresponding permutation. The attention mechanism operates locally within blocks post-permutation. For mixture variants, global and local-attended outputs can be combined.

In graph neural networks, the Sinkhorn-OT machinery is deployed over edges, scoring each with a GNN+MLP subsystem and performing entropic optimal transport between edge scores and the two bins. The resulting edge attention weights αE\alpha^E reweight message passing, while node attention αV\alpha^V aggregates by neighborhood. The architecture remains fully differentiable, allowing end-to-end training with the usual task losses (Ding et al., 11 Feb 2024).

5. Algorithmic and Implementation Details

Key hyperparameters in the sequence model are:

  • Temperature τ{0.25,0.5,0.75,1.0}\tau \in \{0.25, 0.5, 0.75, 1.0\}
  • Sinkhorn iterations K{2,5,10,20}K \in \{2, 5, 10, 20\}
  • Block size BB, selected per task (e.g., B=16, 32, 64B = 16,\ 32,\ 64)
  • SortCut truncation budget nn (number of blocks)
  • Gumbel-Sinkhorn variant, with Gumbel noise ϵ\epsilon added to logits

Sinkhorn normalization is performed in log-space for stability. The SortNet scoring network is typically a single linear layer, with deeper variants offering no improvement. Models were implemented in Tensor2Tensor and trained on both TPUs and GPUs. Sinkhorn Mixture variants compute a convex combination of block-sorted and unsorted attention outputs (Tay et al., 2020).

In the GSINA framework for GNNs, critical hyperparameters include sparsity rr, entropic regularizer τ\tau, Gumbel noise parameter σ\sigma, and Sinkhorn iterations nn. Default values in experiments are τ=1\tau=1, n=10n=10, σ=1\sigma=1 in training and 0 for inference. The complexity per Sinkhorn iteration is linear in the number of edges NeN_e. Both methods are trainable end-to-end via backpropagation through the Sinkhorn steps (Ding et al., 11 Feb 2024).

6. Empirical Evaluation

Sparse Sinkhorn Attention has been evaluated across multiple tasks and domains:

Task/Metric Baseline/Comparison Sparse Sinkhorn Variant(s)
Algorithmic Seq2Seq Sorting (Edit/EM%) Full Transformer: 0.4252/45.69 B=32: 0.4054/49.24
Word-level LM1B Perplexity Full: 41.57 (Base), 27.59 (Big) Sinkhorn: 40.79 (Base)
Char-level LM1B Bytes-per-char Full: 1.283 (Base), 1.121 (Big) Sinkhorn: 1.295 (Base)
CIFAR-10 Pixel-wise Bpd Full: 3.198 Sinkhorn: 3.197
Document Classification (IMDb/SST, acc) Full: 85.12/76.83 (Word) Sinkhorn: 83.54/77.52
Natural Language Inference (SNLI) Full: 78.87 Sinkhorn: 78.62

In graph settings (GSINA), controlling rr and τ\tau allows interpolation between hard top-kk pruning and uniform inclusion. Empirical accuracy sweeps show the accuracy-maximizing rr lies typically in $0.3$–$0.7$, confirming the importance of tuning the sparsity–softness tradeoff (Ding et al., 11 Feb 2024).

7. Advantages, Limitations, and Prospective Extensions

Key advantages:

  • Memory complexity per layer is reduced from O(2)O(\ell^2) (full attention) to O(B2+(/B)2)O(B^2 + (\ell/B)^2) or O()O(\ell) with aggressive SortCut truncation for sequence models.
  • Receptive fields are learned and data-dependent rather than fixed or random.
  • Differentiable sorting (via Sinkhorn) enables gradient-based learning of sparse patterns and soft reordering, endowing the model with novel inductive biases.
  • Empirically, Sparse Sinkhorn Attention achieves competitive or superior performance relative to local, sparse, and full-attention baselines.

Limitations:

  • Computational overhead due to Sinkhorn iterations (KK steps) and possible Gumbel noise sampling.
  • If parameters (temperature τ\tau and KK) are poorly tuned, soft permutations may lead to blurred semantic boundaries.
  • The autoregressive (causal) variant requires masked normalization which is less efficient than the standard non-causal case.

Potential extensions:

  • Learning a convex mixture of sorted and unsorted attention weights.
  • Dynamic block sizing or multi-level (hierarchical) sorting.
  • Integration with LSH-based or recurrence-augmented efficient transformers.
  • Regularization on S(R)S(R) to encourage hard permutations when beneficial (Tay et al., 2020).

A plausible implication is that the Sinkhorn-permutation and OT-based attention paradigm may be adapted to any domain where structural sparsity and adaptive, differentiable selection are desired, extending beyond sequence and graph modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sparse Sinkhorn Attention.