Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Optimization of Sorting Networks

Updated 10 June 2026
  • The paper introduces NeuralSort, a novel differentiable surrogate that enables gradient-based training by continuously relaxing permutation matrices.
  • It leverages softmax-based relaxations and Gumbel reparameterization to approximate traditional sorting operations while maintaining differentiability within neural networks.
  • Empirical results in semantic sorting, quantile regression, and differentiable k-nearest neighbors demonstrate significant performance improvements and efficient computation.

Stochastic optimization of sorting networks addresses the fundamental challenge of making sorting operations differentiable and amenable to gradient-based learning. Classical sorting is non-differentiable, which traditionally prohibits its integration within end-to-end trainable neural architectures. "NeuralSort: Stochastic Optimization of Sorting Networks via Continuous Relaxations" introduces NeuralSort, a differentiable surrogate for sorting based on continuous relaxations of permutation matrices. This framework enables stochastic optimization over permutations by combining NeuralSort with a reparameterized estimator for the Plackett–Luce distribution using Gumbel perturbations, thereby making sorting networks directly tractable within deep learning pipelines (grover et al., 2019).

1. Non-Differentiability of Sorting Operations

The standard sort function takes sRns\in\mathbb{R}^n and produces a permutation π\pi that orders the elements, typically represented by a permutation matrix P{0,1}n×nP\in\{0,1\}^{n\times n}, with exactly one "1" per row and column. The sorting operator is piecewise constant: infinitesimal changes in ss rarely alter the ranking unless a tie occurs, resulting in a Jacobian P/s\partial P/\partial s that is zero almost everywhere and undefined at ties. This non-differentiability means that embedding a sort operation inside a computational graph will zero out or remove all gradient information with respect to inputs, presenting a fundamental barrier to direct gradient-based optimization for objectives dependent on the output ordering (grover et al., 2019).

2. NeuralSort: Continuous Relaxation of Permutations

2.1 Unimodal Row-Stochastic Matrix Relaxation

Permutation matrices PP are relaxed to unimodal row-stochastic matrices UU with the following properties:

  1. Uij0U_{ij}\geq 0 for all i,ji,j
  2. j=1nUij=1\sum_{j=1}^n U_{ij}=1 for all π\pi0
  3. Each row π\pi1 has a unique arg max at column π\pi2, and π\pi3 forms a permutation of π\pi4.

The pairwise-difference matrix π\pi5 enables the exact construction of π\pi6:

π\pi7

otherwise, π\pi8 (grover et al., 2019).

2.2 Softmax-Based Relaxation

NeuralSort replaces row-wise hard arg max with a softmax, yielding the continuous relaxation:

π\pi9

with temperature parameter P{0,1}n×nP\in\{0,1\}^{n\times n}0. An alternative softmax-based formulation is:

P{0,1}n×nP\in\{0,1\}^{n\times n}1

These relaxations yield differentiable matrices for any P{0,1}n×nP\in\{0,1\}^{n\times n}2, recovering hard permutation matrices in the P{0,1}n×nP\in\{0,1\}^{n\times n}3 limit (in absence of ties).

3. Stochastic Gradient Estimation via Plackett–Luce and Gumbel Tricks

3.1 Plackett–Luce Permutation Distribution

The Plackett–Luce (PL) distribution models random permutations P{0,1}n×nP\in\{0,1\}^{n\times n}4 parameterized by positive scores P{0,1}n×nP\in\{0,1\}^{n\times n}5:

P{0,1}n×nP\in\{0,1\}^{n\times n}6

This reflects a sequential draw without replacement, with probabilities proportional to exponentiated scores.

3.2 Gumbel Reparameterization for Sampling and Gradients

Sampling from PLP{0,1}n×nP\in\{0,1\}^{n\times n}7 is enabled using the Gumbel-max trick with i.i.d. GumbelP{0,1}n×nP\in\{0,1\}^{n\times n}8 noise P{0,1}n×nP\in\{0,1\}^{n\times n}9: \begin{align*} \tilde{s}_i &= \log s_i + g_i \ \pi &= \operatorname{sort_indices}(\tilde s) \end{align*} This renders permutation sampling as a deterministic (but non-differentiable) function of ss0 and ss1. The expectation of interest is:

ss2

Approximating the discrete ss3 with NeuralSort yields:

ss4

Gradients w.r.t. ss5 are then given by:

ss6

which can be efficiently approximated using Monte Carlo sampling.

4. Stochastic Optimization Workflow

The NeuralSort stochastic optimization loop can be implemented as follows:

PP5 In the limit ss7, row-wise arg max can be applied to recover hard permutations, supporting straight-through optimization (grover et al., 2019).

5. Complexity Analysis and Computational Characteristics

The construction of the pairwise-difference matrix ss8 requires ss9 operations, fully parallelizable on GPUs. Each softmax operation per row costs P/s\partial P/\partial s0, yielding P/s\partial P/\partial s1 complexity per forward pass with no iterative normalization. Memory requirements are P/s\partial P/\partial s2 per relaxed permutation per sample.

Comparatively:

Approach Forward Pass Complexity GPU Parallelism Differentiability
Standard sorting P/s\partial P/\partial s3 Limited No
Sinkhorn-based relaxations P/s\partial P/\partial s4 per iteration Good Yes (doubly-stochastic)
NeuralSort P/s\partial P/\partial s5 one-shot High Yes (unimodal stochastic)

NeuralSort’s single-pass P/s\partial P/\partial s6 computation is competitive and often faster than iterative Sinkhorn methods for practical P/s\partial P/\partial s7 (grover et al., 2019).

6. Empirical Performance and Impact

NeuralSort and its stochastic extension (via PL reparameterization) were evaluated on several tasks:

  • Semantic Sorting (large-MNIST, P/s\partial P/\partial s8): Deterministic NeuralSort achieves approximately 84% exact permutation accuracy, outperforming Sinkhorn baselines (3–9%) and a naïve row-stochastic predictor (9%). Individual-rank accuracy (correctly placed elements) improves from 60% (Sinkhorn) to 92% (NeuralSort).
  • Quantile Regression (median estimation): Mean squared error decreases from P/s\partial P/\partial s9 (Sinkhorn) to PP0 (NeuralSort), with PP1 improving from 0.25 to 0.94 for PP2.
  • Differentiable k-Nearest Neighbors (PP3, top PP4 selection):
    • MNIST: 99.5% (NeuralSort) vs. 97.2% (standard kNN), 99.4% (CNN)
    • Fashion-MNIST: 93.5% (NeuralSort) vs. 85.8% (kNN), 93.4% (CNN)
    • CIFAR-10: 90.7% (NeuralSort) vs. 35.4% (kNN), 95.1% (CNN)

Across all tasks, stochastic NeuralSort offers comparable accuracy to its deterministic version while enabling principled uncertainty estimation for permutations. This framework supports a one-shot, differentiable surrogate for sorting, hard permutation projection for metrics, and a reparameterized estimator for optimizing over permutation distributions in deep learning pipelines (grover et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Optimization of Sorting Networks.