Unimodal Row-Stochastic Matrices

Updated 13 December 2025

Unimodal row-stochastic matrices are real-valued matrices with each row summing to one and a unique maximum entry, serving as a continuous relaxation of permutation matrices.
They enable end-to-end gradient optimization in learning-to-rank and k-nearest-neighbor tasks by approximating hard sorting operations with soft differentiable surrogates.
These matrices underpin methods like NeuralSort and PiRank, exhibiting theoretical convergence to discrete permutations as the temperature parameter approaches zero.

A unimodal row-stochastic matrix is a real-valued matrix $S \in [0,1]^{n \times n}$ satisfying two structural properties: (1) each row is stochastic, i.e., $\sum_{j=1}^n S_{i,j} = 1$ for all $i$ , and (2) each row is unimodal, meaning it attains its maximum value at exactly one column index (the argmax is unique in each row). This object arises as a continuous relaxation of permutation matrices and forms the analytic core of differentiable sorting operators such as NeuralSort. Such relaxations enable end-to-end optimization over permutation-valued structures via gradient-based methods while preserving a tight connection to the discrete sorting operator in the zero-temperature limit. Unimodal row-stochastic matrices are now central to ranking, sorting, and k-nearest-neighbor tasks in differentiable learning-to-rank and related domains.

1. Mathematical Formulation and Key Properties

Given an input score vector $x \in \mathbb{R}^n$ , the canonical sorting operation outputs a permutation matrix $P_z \in \{0,1\}^{n \times n}$ , where $z = \mathrm{sort}(x)$ is the index order that sorts $x$ in descending order. Each row and column of $P_z$ is a standard basis vector (one-hot), so $P_z$ is doubly stochastic and unimodal by construction.

NeuralSort, proposed in (grover et al., 2019), relaxes $P_z$ to a soft version $S_\tau(x)$ defined as:

$S_\tau(x)[i,j] = \frac{\exp\left\{ \frac{(n+1-2i)\,x_j - \sum_{k=1}^n |x_j - x_k|}{\tau} \right\}}{\sum_{\ell=1}^n \exp\left\{ \frac{(n+1-2i)\,x_\ell - \sum_{k=1}^n |x_\ell - x_k|}{\tau} \right\}}$

for $i,j \in [n]$ and temperature $\tau>0$ . The result is always a unimodal row-stochastic matrix:

Row stochasticity: Each row sums to 1, as is guaranteed by the softmax normalization.
Unimodality: Each row’s maximum occurs at a unique $j$ (almost surely), corresponding to the position for the $i$ th largest entry in $x$ .

As $\tau \to 0^+$ , $S_\tau(x) \to P_z$ , so the relaxation is exact in the low-temperature limit. For any $\tau > 0$ , $S_\tau(x)$ is fully differentiable with respect to $x$ .

2. Role in Continuous Relaxations of Sorting and Ranking

Classical sorting is piecewise constant; thus, it has zero (almost everywhere) or undefined gradients, posing a barrier to direct optimization. By relaxing permutation matrices to the set of unimodal row-stochastic matrices, differentiable surrogates for sorting are available. NeuralSort (grover et al., 2019) and its descendants directly exploit this property to construct smooth, gradient-friendly architectures.

The use of these matrices allows for “soft” ranking and enables gradient flow through sorting operations in neural networks. Soft surrogates can be inserted anywhere a ranking, selection, or ordering is required in the computation graph, and straight-through estimators can be used to reconcile hard selection during the forward pass with soft gradients in the backward pass.

In the context of stochastic optimization over permutations, unimodal row-stochastic matrices also facilitate Monte Carlo and reparameterization approaches for the Plackett-Luce distribution over permutations via the Gumbel-Max trick.

3. Application to Learning-to-Rank and Information Retrieval

Contemporary learning-to-rank methods demand surrogates for non-differentiable metrics such as Normalized Discounted Cumulative Gain (NDCG). Both the NeuralNDCG (Pobrotyn et al., 2021) and PiRank (Swezey et al., 2020) frameworks utilize unimodal row-stochastic matrices to provide differentiable approximations of ranking-based objectives.

The soft permutation matrix $P̂(s; \tau)$ enables direct relaxation of DCG and NDCG by soft-sorting the relevance gain vector or the discount vector, yielding metrics that are arbitrarily accurate as $\tau \to 0$ . Sinkhorn scaling can be optionally applied to approximate doubly stochasticity, though unimodal row-stochasticity alone suffices to preserve the essential sorting structure and differentiability.

4. Algorithmic Implementations and Computational Complexity

The basic procedure to construct a unimodal row-stochastic sorting matrix involves:

Computing the absolute difference matrix $A_x[i,j] = |x_i - x_j|$
Calculating the scaling vector $(n+1-2i)$
Applying the row-wise temperature-controlled softmax normalization.

The computational cost of NeuralSort is $O(n^2)$ per instance due to the pairwise operations and softmax, which is efficient and parallelizable on modern hardware. For very large $n$ , divide-and-conquer algorithms (e.g., PiRank’s recursive merge-sort strategy (Swezey et al., 2020)) exploit the unimodality to reduce runtime and memory requirements, constructing only the first $k$ rows as needed.

5. Theoretical Guarantees and Limiting Behavior

A central mathematical property is that $\lim_{\tau \to 0^+} S_\tau(x) = P_{\mathrm{sort}(x)}$ under mild assumptions (no ties in $x$ ). The unimodal structure ensures that the softmax in each row will favor the location corresponding to the $i$ th largest entry of $x$ , and as temperature decreases the probability mass in each row concentrates at the correct column, thus recovering a permutation matrix.

For any fixed $\tau>0$ , the soft matrix remains continuous and differentiable with respect to $x$ , interpolating between fully “soft” (high entropy) and “hard” permutations as $\tau$ is varied. This gives practitioners explicit control of the bias–variance trade-off in learning.

6. Empirical Performance and Applications

Empirical studies consistently show that unimodal row-stochastic relaxations provide tight surrogates for hard permutations:

On sequence sorting tasks (large-MNIST), NeuralSort achieves over 83% accuracy in predicting the correct order for $n=5$ tasks (grover et al., 2019).
For differentiable $k$ NN, the approach significantly improves performance on MNIST, Fashion-MNIST, and CIFAR-10 compared to baseline non-differentiable and alternative relaxation methods.
In large-scale learning-to-rank benchmarks, both NeuralNDCG and PiRank, founded upon unimodal row-stochastic relaxations, outperform or match classical listwise and pairwise approaches in NDCG, OPA, ARP, and MRR metrics (Pobrotyn et al., 2021, Swezey et al., 2020).

Unimodal row-stochastic matrices serve as a bridge between discrete permutation matrices and continuous, differentiable operators. Sinkhorn-based operators (yielding doubly stochastic matrices) are closely related but enforce both row and column normalization. S-sorts, derived via Sinkhorn-regularized optimal transport, can be compared empirically with NeuralSort—both achieve near-identical or slightly superior performance depending on task and parameterization (Cuturi et al., 2019).

Divide-and-conquer extensions, as in PiRank, leverage the unimodality property to construct scalable surrogates for top- $k$ rankings without instantiating the full $n \times n$ matrix. This reduces computational complexity, especially for large list sizes in industrial-scale information retrieval systems.

In summary, unimodal row-stochastic matrices represent the unique mathematical structure enabling continuous, tractable, and differentiable relaxation of sorting and permutation operators. Their usage underpins modern advances in differentiable ranking, provides theoretical guarantees on convergence to hard permutations, and allows practical application across machine learning, information retrieval, and combinatorial optimization contexts (grover et al., 2019, Pobrotyn et al., 2021, Swezey et al., 2020).