Sinkhorn-Sort: Differentiable Sorting
- Sinkhorn-Sort is a differentiable relaxation of sorting that uses entropic-regularized optimal transport and Sinkhorn iteration to approximate permutation matrices.
- It enables end-to-end backpropagation by providing smooth gradients, balancing between exact permutations and mean operations through temperature tuning.
- Applications include numeric sorting, quantile regression, policy learning, and Transformer attention mechanisms, offering improved accuracy and reduced computational complexity.
Sinkhorn-Sort is a class of algorithmically differentiable operators providing smooth relaxations of sorting and ranking, constructed via entropic-regularized optimal transport and implemented efficiently using Sinkhorn iteration. These methods underpin a spectrum of modern deep learning solutions requiring end-to-end learnable sorting, ranking, and permutation modules, enabling differentiability for learning tasks that traditionally rely on discrete, piecewise-constant operations.
1. Mathematical Foundations: Sorting as Optimal Transport
The central insight of Sinkhorn-Sort is that sorting can be cast as a one-dimensional optimal transport (OT) or Kantorovich assignment problem. Given an input vector , the goal is to map to its sorted version. This is formulated as a transport problem between the input values and a fixed target vector , both endowed with uniform discrete measures .
The assignment cost is captured by the matrix , more generally for convex . The unregularized OT problem seeks the doubly-stochastic matrix minimizing under row and column sum constraints. For strictly convex 0, 1 is 2 times a permutation matrix encoding the sorting permutation. The sorted vector and rank vector are recovered as:
- 3
- 4
This perspective connects discrete sorting directly to continuous optimization structures (Cuturi et al., 2019).
2. Entropic Regularization and the Sinkhorn Operator
To enable differentiability, a Shannon entropy penalty 5 is introduced, controlled by temperature 6. The regularized OT problem becomes: 7 The regularized problem admits a unique solution in the Birkhoff polytope, which is everywhere differentiable in 8 when 9 (Cuturi et al., 2019, Mena et al., 2018). As 0, 1 approaches the permutation matrix solution, but gradients vanish almost everywhere.
The Sinkhorn operator parameterizes 2 as 3 with 4, and 5 chosen by alternate matrix scaling (Sinkhorn-Knopp iteration):
- 6
- 7
This procedure converges exponentially fast to the regularized solution (Cuturi et al., 2019, Mena et al., 2018).
3. Sinkhorn-Sort and Differentiable Relaxations
The 8-smooth sort operator or S-sort is defined as 9, providing a differentiable approximation to sorting. Each coordinate
0
smoothly interpolates between mean (for large 1) and exact sorting (as 2).
Key properties:
- Differentiability: 3 is everywhere differentiable for 4.
- Gradient trade-off: Small 5 approximates permutations but causes vanishing gradients; large 6 increases smoothness but dilutes permutation structure.
- Computational complexity: Each Sinkhorn iteration requires 7 time; typically 8–9 iterations suffice for convergence.
- Automatic Differentiation: S-sort can be differentiated by unrolling iterations or via the implicit function theorem, ensuring nonzero gradients throughout (Cuturi et al., 2019, Mena et al., 2018).
Generalizations yield soft CDF (“K-rank”), soft quantiles (“K-quantile”), and empirical CDF approximations.
4. Neural and Policy Network Integration
Sinkhorn-Sort is integrated into neural architectures via two principal pipelines:
Neural Sinkhorn-Sort (Mena et al., 2018): A permutation-equivariant neural network 0 outputs a score matrix 1, processed by Gumbel noise (optional) and temperature scaling, exponentiation, and Sinkhorn normalization to yield a doubly-stochastic matrix 2. The soft-sorted output is 3. This formulation supports end-to-end backpropagation through sorting, enabling training on global or structured losses.
Sinkhorn Policy Gradient (SPG) (Emami et al., 2018): In reinforcement learning, a Sinkhorn layer is deployed as the actor in an actor-critic framework, producing soft permutation matrices from the input and score network, and using Hungarian rounding at inference to actualize discrete permutations. The critic receives both continuous and discrete assignments, with loss terms to align the critic’s value on relaxed and discrete actions. Training employs temperature annealing, regularization, and straight-through estimators for efficient permutation policy learning.
Comparison with related approaches:
Sinkhorn-Sort outperforms earlier sequence-to-sequence and NeuralSort methods in sorting and ranking accuracy for long sequences, and provides stronger generalization to unfamiliar input distributions (Cuturi et al., 2019, Mena et al., 2018).
5. Transformer and Attention Applications
Sinkhorn-Sort is leveraged in attention mechanisms to achieve quasi-global sparse attention with efficient memory characteristics. In the Sinkhorn Transformer (Tay et al., 2020):
- The input sequence is partitioned into blocks, pooled, and passed through a meta-sorting network to score and permute blocks via a Sinkhorn-approximated permutation matrix 4.
- The sorted blocks are “unpooled,” and attention is computed within and across these re-ordered blocks, yielding content-based locality.
- Variants such as causal Sinkhorn balancing are used to enforce autoregressive properties, and algorithms like SortCut dynamically truncate output sequences post sorting to meet computational budgets.
This structure reduces the complexity of attention from 5 to 6 (7 = sequence length, 8 = block size), with further reductions via SortCut to 9 for selected top-0 blocks (Tay et al., 2020).
Empirical results: Sinkhorn Transformers match or improve upon vanilla Transformers and other efficient sparse models on language modeling, image generation, algorithmic sequence sorting, and classification tasks, consistently providing memory and compute savings (Tay et al., 2020).
6. Practical Implementation and Optimization
Effective deployment of Sinkhorn-Sort requires careful tuning of hyperparameters:
- Temperature: 1 or 2 balances permutation sharpness and gradient flow; best performance typically occurs for 3.
- Iteration count: Sufficient (e.g., 4=10–20) Sinkhorn steps ensure near-doubly-stochastic solutions.
- Numerical stability: Log-domain updates are advisable for small temperature regimes to avoid underflow/overflow.
- Regularization: Penalties enforcing doubly-stochasticity and entropy bonuses may improve stability.
- Parallelism: Batch and tensorized implementations accelerate training and inference.
A variety of strategies, including Gumbel noise for stochastic approximations and temperature annealing, are effective in balancing permutation discreteness with trainability (Cuturi et al., 2019, Mena et al., 2018, Emami et al., 2018, Tay et al., 2020).
7. Extensions, Applications, and Empirical Performance
Table: Representative Sinkhorn-Sort Applications and Results
| Domain | Approach | Performance Highlights |
|---|---|---|
| Numeric sorting | Neural Sinkhorn-Sort (Mena et al., 2018) | Perfect sorting up to 5, >99% at 6 |
| Policy learning | SPG (Emami et al., 2018) | KT=0.999 (N=20), 0.985 (N=50) |
| Quantile regression | S-quantile (Cuturi et al., 2019) | Competitive quantile/MSE error on UCI |
| Transformer attention | Sinkhorn Transformer (Tay et al., 2020) | Matches/outperforms full attention with less memory |
Sinkhorn-Sort has been validated across tasks such as sorting integers, quantile regression, differentiable top-7 losses, attention for long sequence models, and policy learning for assignment and matching. In each, it delivers efficient, differentiable surrogates for discrete permutation operations, often outperforming or matching baselines while enabling scalable learning (Cuturi et al., 2019, Mena et al., 2018, Tay et al., 2020, Emami et al., 2018).
Sinkhorn-Sort bridges discrete combinatorial structure and differentiable computation by leveraging entropic OT, providing a foundation for permutation- and ranking-centric deep learning architectures. It enables a wide array of applications—from sorting operators for neural network modules to complexity-efficient sparse attention and policy learning—by harnessing the smooth interpolations within the Birkhoff polytope enabled by the Sinkhorn operator.