NeuralSort: Differentiable Sorting Operator

Updated 13 December 2025

NeuralSort is a continuous relaxation of the sorting operator that uses temperature-controlled softmax to approximate permutation matrices.
It converts discrete, non-differentiable rank operations into smooth, row-stochastic matrices for gradient-based optimization.
Practical applications include learning-to-rank, differentiable k-nearest neighbor, and direct optimization of ranking metrics like NDCG.

NeuralSort is a continuous, temperature-controlled relaxation of the sorting operator that enables differentiable sorting within neural computation graphs. The operator is central to a body of work addressing the challenge of making rank-based operations—typically non-differentiable and thus incompatible with gradient-based optimization—tractable for end-to-end learning systems. NeuralSort replaces the discrete permutation matrix corresponding to sorting with a unimodal row-stochastic matrix whose rows approximate soft assignments to ranks; as the temperature approaches zero, this matrix converges to the exact permutation matrix. This property has enabled applications in learning-to-rank (LTR), differentiable k-nearest neighbor algorithms, and direct optimization of ranking metrics such as NDCG, providing a rigorous bridge between discrete combinatorial objectives and continuous optimization landscapes (grover et al., 2019, Swezey et al., 2020, Pobrotyn et al., 2021).

1. Mathematical Definition and Core Construction

Let $s \in \mathbb{R}^n$ denote a vector of real-valued scores to be sorted in descending order. The discrete sort can be represented as a permutation matrix $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ , where each row corresponds to a “rank” and selects the entry with the corresponding sorted value. NeuralSort constructs a continuous relaxation $\widehat{P}_\tau(s)$ , parameterized by a scalar temperature $\tau > 0$ , as follows:

Define the pairwise absolute-difference matrix:

$(A_s)_{ij} = |s_i - s_j|$

Set the rank offset $o_i = n+1-2i$ for $i = 1, \ldots, n$ .
For each row $i$ , compute:

$\widehat{P}_\tau(s)[i, :] = \mathrm{softmax}\left( \frac{o_i \cdot s - A_s 1}{\tau} \right)$

where $1 \in \mathbb{R}^n$ is the all-ones vector and softmax is row-wise.

By design, $\widehat{P}_\tau(s)$ is row-stochastic: its rows are non-negative and sum to one, each with a unique peak corresponding to a “soft” rank. As $\tau \to 0^+$ , the softmax sharpens into a hard argmax, so $\widehat{P}_\tau(s) \to P_{\mathrm{sort}(s)}$ almost surely when $s$ has distinct entries (grover et al., 2019, Swezey et al., 2020, Pobrotyn et al., 2021).

2. Theoretical Properties and Unimodality

NeuralSort guarantees several key properties:

Unimodality: Each row of $\widehat{P}_\tau(s)$ has a unique maximum, ensuring correspondence to a single ranked item.
Row-stochasticity: All entries are non-negative and rows sum to one.
Consistency: Under mild conditions (distinct scores), $\displaystyle \lim_{\tau \to 0^+} \widehat{P}_\tau(s) = P_{\mathrm{sort}(s)}$ .
Differentiability: The mapping $s \mapsto \widehat{P}_\tau(s)$ is everywhere continuous and (almost everywhere) differentiable for $\tau > 0$ .
Permissibility for backpropagation: The operator can be inserted anywhere a permutation or sort would be required in a computational graph, enabling end-to-end optimization via automatic differentiation (grover et al., 2019, Pobrotyn et al., 2021).

3. Applications in Learning to Rank and Ranking Metrics

A principal motivation for NeuralSort is to allow direct minimization of rank-based objectives such as NDCG and ARP. Let $y \in \mathbb{R}^n$ denote graded relevances, $g_j = 2^{y_j}-1$ the gain vector, and $f_\theta$ a scoring model producing predictions $\hat{y}$ . The ideal (non-differentiable) NDCG@k metric is

$\mathrm{DCG}(y, \pi) = \sum_{j=1}^k \frac{g_{\pi_j}}{\log_2(1+j)}, \qquad \mathrm{NDCG}(y, \pi) = \frac{\mathrm{DCG}(y, \pi)}{\mathrm{DCG}(y, \pi^\ast)},$

where $\pi = \mathrm{sort}(\hat{y})$ and $\pi^\ast$ is the ideal permutation.

NeuralSort provides a differentiable surrogate by replacing $P_{\mathrm{sort}(\hat{y})}$ with $\widehat{P}_\tau(\hat{y})$ :

$\hat{\mathrm{DCG}}(y, \hat{y}; \tau) = \sum_{j=1}^k \frac{\left[\widehat{P}_\tau(\hat{y}) g\right]_j}{\log_2(1+j)}, \qquad \hat{\mathrm{NDCG}}(y, \hat{y}; \tau) = \frac{\hat{\mathrm{DCG}}(y, \hat{y}; \tau)}{\mathrm{DCG}(y, \pi^\ast)}.$

This framework underpins several LTR surrogates, notably in PiRank and NeuralNDCG, where the loss is defined as $1 - \hat{\mathrm{NDCG}}(y, \hat{y}; \tau)$ (Swezey et al., 2020, Pobrotyn et al., 2021).

4. Algorithmic Implementation and Extensions

A vectorized “forward pass” for a batch of $B$ score-vectors $S \in \mathbb{R}^{B \times n}$ proceeds as:

Compute pairwise-difference tensors: $A^{(b)}_{ij} = |S_{b,i} - S_{b,j}|$ .
Precompute rank-offset vector $o$ .
Build pre-softmax logits: $U_{b,i,j} = (o_i \cdot S_{b,j} - \sum_{k=1}^n A^{(b)}_{j,k}) / \tau$ .
Apply row-wise softmax for $\widehat{P}_{b,i,:}$ .

An optional Sinkhorn normalization can be applied to obtain doubly-stochastic matrices (Pobrotyn et al., 2021).

For large list sizes $L$ , a direct application of NeuralSort incurs $O(L^2)$ complexity. PiRank introduces a divide-and-conquer extension:

View the vector as leaves of a $d$ -level tree with branching $b_j$ so $L = \prod_{j=1}^d b_j$ .
At each merge level, apply NeuralSort to blocks, retaining only top- $k_j$ soft scores per node.
Compose the soft permutation across levels; the total complexity is reduced to $O(d \cdot k^2 L^{2/d})$ (Swezey et al., 2020).

5. Empirical Performance and Benchmarks

In benchmarks on public LTR datasets (MSLR-WEB30K, Yahoo! C14), PiRank’s NeuralSort-based surrogates matched or outperformed established baselines (RankNet, LambdaRank, Softmax-loss, Approximate-NDCG, NeuralSort cross-entropy) on 13/16 metrics, with statistical significance on NDCG@5, 10, and 15. For example, on MSLR-WEB30K, PiRank achieved NDCG@10=0.4464 (best), and on Yahoo! C14, NDCG@10=0.7385 (best).**

An ablation showed that increasing training list size $L_\text{train}$ substantially improves performance for fixed test list sizes and top- $k$ , with relative NDCG@1 gains greater than 10% as $L_\text{train}$ increases from 10 to 100 for $k=1$ . A synthetic experiment on the divide-and-conquer depth parameter $d$ confirmed theoretical wall-clock speedups: $O(L^2)$ for $d=1$ (flat NeuralSort), $O(L^{4/3})$ for $d=3$ (binary-merge PiRank) (Swezey et al., 2020).

When applied to differentiable $k$ -nearest neighbor classification, NeuralSort achieved accuracy competitive with task-specific convolutional networks and markedly superior to classic $k$ NN baselines: 99.5% on MNIST, 93.5% on Fashion-MNIST, and 90.7% on CIFAR-10 (grover et al., 2019).

6. Connections to Stochastic Optimization and Reparameterized Gradients

NeuralSort enables reparameterized stochastic optimization under permutation-valued distributions. Notably, for the Plackett–Luce distribution over permutations:

A permutation sample from $q(z \mid s)$ with $s > 0$ can be reparameterized by adding i.i.d. Gumbel noise to $\log s$ and sorting.
By replacing the discrete sort with the NeuralSort relaxation in the surrogate loss, one obtains a low-variance, reparameterized gradient estimator suitable for policy gradients and variational inference in permutation-structured problems (grover et al., 2019).

7. Limitations, Variants, and Practical Considerations

Several implementation aspects affect NeuralSort’s practical deployment:

Temperature selection: The smoothness-accuracy tradeoff is governed by $\tau$ : small $\tau$ yields sharper approximations but potentially high gradient variance. Empirically $\tau \in [0.01, 100]$ is robust with $\tau=1.0$ often effective, and temperature annealing can sharpen the sort progressively (Pobrotyn et al., 2021).
Sinkhorn normalization (optional): For applications requiring doubly-stochastic constraints, post-processing via Sinkhorn scaling is feasible, though not inherently part of the original NeuralSort formulation (Pobrotyn et al., 2021).
Scalability to large lists: Direct $O(L^2)$ cost is prohibitive for large $L$ , motivating hierarchical merge-style relaxations as in PiRank (Swezey et al., 2020).

NeuralSort’s unimodal row-stochastic relaxation is distinct from the doubly-stochastic approaches (e.g., Sinkhorn operator), demonstrating superior accuracy on sorting and quantile regression tasks for small $n$ (grover et al., 2019).

References:

(grover et al., 2019) Stochastic Optimization of Sorting Networks via Continuous Relaxations (Swezey et al., 2020) PiRank: Scalable Learning To Rank via Differentiable Sorting (Pobrotyn et al., 2021) NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting