Stochastic Optimization of Sorting Networks

Updated 10 June 2026

The paper introduces NeuralSort, a novel differentiable surrogate that enables gradient-based training by continuously relaxing permutation matrices.
It leverages softmax-based relaxations and Gumbel reparameterization to approximate traditional sorting operations while maintaining differentiability within neural networks.
Empirical results in semantic sorting, quantile regression, and differentiable k-nearest neighbors demonstrate significant performance improvements and efficient computation.

Stochastic optimization of sorting networks addresses the fundamental challenge of making sorting operations differentiable and amenable to gradient-based learning. Classical sorting is non-differentiable, which traditionally prohibits its integration within end-to-end trainable neural architectures. "NeuralSort: Stochastic Optimization of Sorting Networks via Continuous Relaxations" introduces NeuralSort, a differentiable surrogate for sorting based on continuous relaxations of permutation matrices. This framework enables stochastic optimization over permutations by combining NeuralSort with a reparameterized estimator for the Plackett–Luce distribution using Gumbel perturbations, thereby making sorting networks directly tractable within deep learning pipelines (grover et al., 2019).

1. Non-Differentiability of Sorting Operations

The standard sort function takes $s\in\mathbb{R}^n$ and produces a permutation $\pi$ that orders the elements, typically represented by a permutation matrix $P\in\{0,1\}^{n\times n}$ , with exactly one "1" per row and column. The sorting operator is piecewise constant: infinitesimal changes in $s$ rarely alter the ranking unless a tie occurs, resulting in a Jacobian $\partial P/\partial s$ that is zero almost everywhere and undefined at ties. This non-differentiability means that embedding a sort operation inside a computational graph will zero out or remove all gradient information with respect to inputs, presenting a fundamental barrier to direct gradient-based optimization for objectives dependent on the output ordering (grover et al., 2019).

2. NeuralSort: Continuous Relaxation of Permutations

2.1 Unimodal Row-Stochastic Matrix Relaxation

Permutation matrices $P$ are relaxed to unimodal row-stochastic matrices $U$ with the following properties:

$U_{ij}\geq 0$ for all $i,j$
$\sum_{j=1}^n U_{ij}=1$ for all $\pi$ 0
Each row $\pi$ 1 has a unique arg max at column $\pi$ 2, and $\pi$ 3 forms a permutation of $\pi$ 4.

The pairwise-difference matrix $\pi$ 5 enables the exact construction of $\pi$ 6:

$\pi$ 7

otherwise, $\pi$ 8 (grover et al., 2019).

2.2 Softmax-Based Relaxation

NeuralSort replaces row-wise hard arg max with a softmax, yielding the continuous relaxation:

$\pi$ 9

with temperature parameter $P\in\{0,1\}^{n\times n}$ 0. An alternative softmax-based formulation is:

$P\in\{0,1\}^{n\times n}$ 1

These relaxations yield differentiable matrices for any $P\in\{0,1\}^{n\times n}$ 2, recovering hard permutation matrices in the $P\in\{0,1\}^{n\times n}$ 3 limit (in absence of ties).

3. Stochastic Gradient Estimation via Plackett–Luce and Gumbel Tricks

3.1 Plackett–Luce Permutation Distribution

The Plackett–Luce (PL) distribution models random permutations $P\in\{0,1\}^{n\times n}$ 4 parameterized by positive scores $P\in\{0,1\}^{n\times n}$ 5:

$P\in\{0,1\}^{n\times n}$ 6

This reflects a sequential draw without replacement, with probabilities proportional to exponentiated scores.

3.2 Gumbel Reparameterization for Sampling and Gradients

Sampling from PL $P\in\{0,1\}^{n\times n}$ 7 is enabled using the Gumbel-max trick with i.i.d. Gumbel $P\in\{0,1\}^{n\times n}$ 8 noise $P\in\{0,1\}^{n\times n}$ 9: \begin{align*} \tilde{s}_i &= \log s_i + g_i \ \pi &= \operatorname{sort_indices}(\tilde s) \end{align*} This renders permutation sampling as a deterministic (but non-differentiable) function of $s$ 0 and $s$ 1. The expectation of interest is:

$s$ 2

Approximating the discrete $s$ 3 with NeuralSort yields:

$s$ 4

Gradients w.r.t. $s$ 5 are then given by:

$s$ 6

which can be efficiently approximated using Monte Carlo sampling.

4. Stochastic Optimization Workflow

The NeuralSort stochastic optimization loop can be implemented as follows:

$P$ 5 In the limit $s$ 7, row-wise arg max can be applied to recover hard permutations, supporting straight-through optimization (grover et al., 2019).

5. Complexity Analysis and Computational Characteristics

The construction of the pairwise-difference matrix $s$ 8 requires $s$ 9 operations, fully parallelizable on GPUs. Each softmax operation per row costs $\partial P/\partial s$ 0, yielding $\partial P/\partial s$ 1 complexity per forward pass with no iterative normalization. Memory requirements are $\partial P/\partial s$ 2 per relaxed permutation per sample.

Comparatively:

Approach	Forward Pass Complexity	GPU Parallelism	Differentiability
Standard sorting	$\partial P/\partial s$ 3	Limited	No
Sinkhorn-based relaxations	$\partial P/\partial s$ 4 per iteration	Good	Yes (doubly-stochastic)
NeuralSort	$\partial P/\partial s$ 5 one-shot	High	Yes (unimodal stochastic)

NeuralSort’s single-pass $\partial P/\partial s$ 6 computation is competitive and often faster than iterative Sinkhorn methods for practical $\partial P/\partial s$ 7 (grover et al., 2019).

6. Empirical Performance and Impact

NeuralSort and its stochastic extension (via PL reparameterization) were evaluated on several tasks:

Semantic Sorting (large-MNIST, $\partial P/\partial s$ 8): Deterministic NeuralSort achieves approximately 84% exact permutation accuracy, outperforming Sinkhorn baselines (3–9%) and a naïve row-stochastic predictor (9%). Individual-rank accuracy (correctly placed elements) improves from 60% (Sinkhorn) to 92% (NeuralSort).
Quantile Regression (median estimation): Mean squared error decreases from $\partial P/\partial s$ 9 (Sinkhorn) to $P$ 0 (NeuralSort), with $P$ 1 improving from 0.25 to 0.94 for $P$ 2.
Differentiable k-Nearest Neighbors ( $P$ 3, top $P$ 4 selection):
- MNIST: 99.5% (NeuralSort) vs. 97.2% (standard kNN), 99.4% (CNN)
- Fashion-MNIST: 93.5% (NeuralSort) vs. 85.8% (kNN), 93.4% (CNN)
- CIFAR-10: 90.7% (NeuralSort) vs. 35.4% (kNN), 95.1% (CNN)

Across all tasks, stochastic NeuralSort offers comparable accuracy to its deterministic version while enabling principled uncertainty estimation for permutations. This framework supports a one-shot, differentiable surrogate for sorting, hard permutation projection for metrics, and a reparameterized estimator for optimizing over permutation distributions in deep learning pipelines (grover et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Stochastic Optimization of Sorting Networks via Continuous Relaxations (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Optimization of Sorting Networks.