Optimal Transport SOFT Top-k
- Optimal Transport (OT) is a framework that couples probability measures under cost constraints, widely applied for smooth relaxations of discrete selection.
- The SOFT top-k operator employs entropic and quadratic regularizations to approximate non-differentiable top-k selection, enabling gradient-based learning.
- Efficient algorithms like Sinkhorn iterations and implicit differentiation compute sparsity and gradients, making OT methods scalable for tasks like quantile regression and sparse attention.
Optimal transport (OT) seeks a coupling between two probability measures that matches prescribed marginals and optimizes a given cost. In many classical and modern applications—such as machine learning layers, regression, or selection routines—OT underpins differentiable relaxations of sorting, ranking, quantile estimation, and top-k selection, otherwise non-smooth or combinatorial. The SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator and its related differentiable OT-based approximations constitute a core methodology for end-to-end learning in such settings, leveraging entropic regularization, stochastic root-finding, and sparsity constraints.
1. Discrete Top-k Selection, Non-differentiability, and OT Relaxations
The standard top-k operator maps a score vector to the set of the largest elements. Formally, it produces a binary indicator with , where if is among the top-k. This mapping is piecewise constant and discontinuous: its gradient is undefined at ties and vanishes almost everywhere, precluding gradient-based optimization. This motivates relaxation via OT.
Discrete sorting and top-k can be posed as finite OT problems: each data point is a source, targets correspond to the selection bins, and the goal is to maximize inner product subject to selection constraints:
This combinatorial optimization aligns precisely with an OT problem where mass must be transported from uniform source marginals to targets encoding the selection cardinalities (Xie et al., 2020, Cuturi et al., 2019).
2. Entropic OT and the SOFT Top-k Operator
Introducing an entropy penalty renders the OT problem strictly convex and smooths the solution. Given scores :
- Let , 0 (uniform source).
- Target measure 1: 2.
- Cost matrix 3:
4
The entropic OT (EOT) problem is
5
subject to 6, 7, 8.
The solution 9 is unique, smooth in 0, and computed efficiently by Sinkhorn iterations:
1
updating scaling vectors 2, 3 alternately to enforce marginal constraints (Xie et al., 2020, Cuturi et al., 2019).
The output soft top-k indicator is 4, which converges in the 5 limit to the discrete indicator.
3. Differentiability and Gradient Computation
The entropic OT framework ensures the mapping 6 is differentiable almost everywhere. The solution 7 admits dual variables, and gradients can be computed via implicit differentiation of the Sinkhorn fixed-point equations or through KKT conditions:
8
The derivatives 9 and ultimately 0 are computable in closed form due to smoothness of the marginal constraint system. Full block-matrix gradient expressions appear in the respective appendices (Xie et al., 2020). This property is central to enabling backpropagation in complex models using SOFT top-k routines as differentiable layers.
4. Sparsity-Constrained OT and Soft Top-k via Quadratic Regularization
In contrast to entropic regularization, quadratic regularization enables explicit control over sparsity via cardinality constraints. The primal formulation imposes:
1
where 2 is the 3 ball, and 4 is the corresponding indicator.
The semi-dual problem is
5
with explicit expressions for the conjugate and its gradient:
6
This yields the "soft top-k" operator: take the 7 largest entries of 8, subtract a normalization scalar 9 (found by simplex projection), and threshold. This operator interpolates between hard top-k for 0 and fully dense mapping as 1 increases (Liu et al., 2022).
Gradient-based optimization of this sparse OT is tractable and efficient, scaling as 2 per update.
5. Quantile Optimization and Soft Top-k in Semidiscrete OT
Recent work studies minimizing quantiles—not means—of the cost in semidiscrete OT. Here, one measure is continuous (3 with density 4), the other discrete (5 with probabilities 6), and the cost 7.
The quantile OT problem is:
8
where 9 is the 0-quantile of cost.
Optimal solutions reduce to a finite-dimensional convex program in binary variables 1. The "tie-breaking" necessary to preserve prescribed marginals, when the assignment cell 2 of 3 comprises multiple indices, is realized by entropy-regularized softmax distributions on the active set.
The softmax tie-breaking rule is formally:
4
where 5 solves a strictly convex, unconstrained dual. Stochastic approximation algorithms suffice for finding 6. Provably, induced randomized kernels converge at rate 7, as does the estimation of optimal quantile threshold 8 (Zhu et al., 11 Feb 2026).
This framework exhibits a "soft power diagram" geometry: for each class 9, the cell 0 is defined by sublevel sets; where these overlap, probabilistic mixing occurs based on 1. The geometry interpolates between deterministic power diagrams at the mean and randomized, partially overlapping cells at higher quantiles.
6. Algorithmic Implementation and Scalability
Algorithmic implementation is streamlined by the underlying regularized OT structure. For the SOFT top-k:
- Sinkhorn iterations for small 2 (3 for 4 iterations)
- Memory requirements are low (storage of kernel 5, scaling vectors)
- For sparse OT, projections onto top-6+simplex constraints are efficiently computable by sorting (7)
- Closed-form expressions for gradients are available for both entropic and quadratic regimes
Root-finding for quantile thresholds in the semidiscrete setting exploits monotonicity and empirical process concentration (8 rates). Each top-k or quantile evaluation reduces to a linear program or explicit projection step (Zhu et al., 11 Feb 2026, Liu et al., 2022).
7. Applications and Extensions
The SOFT top-k paradigm is central to differentiable 9-nearest neighbors, beam search, sparse mixture-of-experts layers, sparse attention modules, soft cumulative ranking/statistics, and quantile regression. In neural architectures, soft top-k layers allow for end-to-end training—replacing discontinuous selection with differentiable relaxations—while maintaining interpretability and efficient memory usage (Xie et al., 2020, Cuturi et al., 2019, Liu et al., 2022).
Extensions encompass:
- Sorted SOFT top-k with explicit rank preservation
- Unbalanced OT relaxations for adaptive cardinality
- Geometric partitioning with soft power diagrams driven by quantile or tail cost objectives
- Sparse assignment constraints for computational savings in large-scale or expert-model architectures
Empirical analyses indicate comparable or improved performance relative to cross-entropy or standard softmax models across diverse benchmarks.
The use of optimal transport, especially its entropic and sparsity-constrained regularizations, has enabled robust, efficient, and provably accurate soft top-k and quantile-based operations for core machine learning tasks (Zhu et al., 11 Feb 2026, Liu et al., 2022, Xie et al., 2020, Cuturi et al., 2019).