Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal Transport SOFT Top-k

Updated 30 June 2026
  • Optimal Transport (OT) is a framework that couples probability measures under cost constraints, widely applied for smooth relaxations of discrete selection.
  • The SOFT top-k operator employs entropic and quadratic regularizations to approximate non-differentiable top-k selection, enabling gradient-based learning.
  • Efficient algorithms like Sinkhorn iterations and implicit differentiation compute sparsity and gradients, making OT methods scalable for tasks like quantile regression and sparse attention.

Optimal transport (OT) seeks a coupling between two probability measures that matches prescribed marginals and optimizes a given cost. In many classical and modern applications—such as machine learning layers, regression, or selection routines—OT underpins differentiable relaxations of sorting, ranking, quantile estimation, and top-k selection, otherwise non-smooth or combinatorial. The SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator and its related differentiable OT-based approximations constitute a core methodology for end-to-end learning in such settings, leveraging entropic regularization, stochastic root-finding, and sparsity constraints.

1. Discrete Top-k Selection, Non-differentiability, and OT Relaxations

The standard top-k operator maps a score vector sRns\in\mathbb{R}^n to the set of the kk largest elements. Formally, it produces a binary indicator a{0,1}na\in\{0,1\}^n with i=1nai=k\sum_{i=1}^n a_i = k, where ai=1a_i = 1 if sis_i is among the top-k. This mapping is piecewise constant and discontinuous: its gradient is undefined at ties and vanishes almost everywhere, precluding gradient-based optimization. This motivates relaxation via OT.

Discrete sorting and top-k can be posed as finite OT problems: each data point is a source, targets correspond to the kk selection bins, and the goal is to maximize inner product subject to selection constraints:

a=argmaxa{0,1}n,iai=kasa^* = \arg\max_{a \in \{0,1\}^n, \sum_i a_i = k} a^\top s

This combinatorial optimization aligns precisely with an OT problem where mass must be transported from uniform source marginals to targets encoding the selection cardinalities (Xie et al., 2020, Cuturi et al., 2019).

2. Entropic OT and the SOFT Top-k Operator

Introducing an entropy penalty renders the OT problem strictly convex and smooths the solution. Given scores sRns\in\mathbb{R}^n:

  • Let μRn\mu \in \mathbb{R}^n, kk0 (uniform source).
  • Target measure kk1: kk2.
  • Cost matrix kk3:

kk4

The entropic OT (EOT) problem is

kk5

subject to kk6, kk7, kk8.

The solution kk9 is unique, smooth in a{0,1}na\in\{0,1\}^n0, and computed efficiently by Sinkhorn iterations:

a{0,1}na\in\{0,1\}^n1

updating scaling vectors a{0,1}na\in\{0,1\}^n2, a{0,1}na\in\{0,1\}^n3 alternately to enforce marginal constraints (Xie et al., 2020, Cuturi et al., 2019).

The output soft top-k indicator is a{0,1}na\in\{0,1\}^n4, which converges in the a{0,1}na\in\{0,1\}^n5 limit to the discrete indicator.

3. Differentiability and Gradient Computation

The entropic OT framework ensures the mapping a{0,1}na\in\{0,1\}^n6 is differentiable almost everywhere. The solution a{0,1}na\in\{0,1\}^n7 admits dual variables, and gradients can be computed via implicit differentiation of the Sinkhorn fixed-point equations or through KKT conditions:

a{0,1}na\in\{0,1\}^n8

The derivatives a{0,1}na\in\{0,1\}^n9 and ultimately i=1nai=k\sum_{i=1}^n a_i = k0 are computable in closed form due to smoothness of the marginal constraint system. Full block-matrix gradient expressions appear in the respective appendices (Xie et al., 2020). This property is central to enabling backpropagation in complex models using SOFT top-k routines as differentiable layers.

4. Sparsity-Constrained OT and Soft Top-k via Quadratic Regularization

In contrast to entropic regularization, quadratic regularization enables explicit control over sparsity via cardinality constraints. The primal formulation imposes:

i=1nai=k\sum_{i=1}^n a_i = k1

where i=1nai=k\sum_{i=1}^n a_i = k2 is the i=1nai=k\sum_{i=1}^n a_i = k3 ball, and i=1nai=k\sum_{i=1}^n a_i = k4 is the corresponding indicator.

The semi-dual problem is

i=1nai=k\sum_{i=1}^n a_i = k5

with explicit expressions for the conjugate and its gradient:

i=1nai=k\sum_{i=1}^n a_i = k6

This yields the "soft top-k" operator: take the i=1nai=k\sum_{i=1}^n a_i = k7 largest entries of i=1nai=k\sum_{i=1}^n a_i = k8, subtract a normalization scalar i=1nai=k\sum_{i=1}^n a_i = k9 (found by simplex projection), and threshold. This operator interpolates between hard top-k for ai=1a_i = 10 and fully dense mapping as ai=1a_i = 11 increases (Liu et al., 2022).

Gradient-based optimization of this sparse OT is tractable and efficient, scaling as ai=1a_i = 12 per update.

5. Quantile Optimization and Soft Top-k in Semidiscrete OT

Recent work studies minimizing quantiles—not means—of the cost in semidiscrete OT. Here, one measure is continuous (ai=1a_i = 13 with density ai=1a_i = 14), the other discrete (ai=1a_i = 15 with probabilities ai=1a_i = 16), and the cost ai=1a_i = 17.

The quantile OT problem is:

ai=1a_i = 18

where ai=1a_i = 19 is the sis_i0-quantile of cost.

Optimal solutions reduce to a finite-dimensional convex program in binary variables sis_i1. The "tie-breaking" necessary to preserve prescribed marginals, when the assignment cell sis_i2 of sis_i3 comprises multiple indices, is realized by entropy-regularized softmax distributions on the active set.

The softmax tie-breaking rule is formally:

sis_i4

where sis_i5 solves a strictly convex, unconstrained dual. Stochastic approximation algorithms suffice for finding sis_i6. Provably, induced randomized kernels converge at rate sis_i7, as does the estimation of optimal quantile threshold sis_i8 (Zhu et al., 11 Feb 2026).

This framework exhibits a "soft power diagram" geometry: for each class sis_i9, the cell kk0 is defined by sublevel sets; where these overlap, probabilistic mixing occurs based on kk1. The geometry interpolates between deterministic power diagrams at the mean and randomized, partially overlapping cells at higher quantiles.

6. Algorithmic Implementation and Scalability

Algorithmic implementation is streamlined by the underlying regularized OT structure. For the SOFT top-k:

  • Sinkhorn iterations for small kk2 (kk3 for kk4 iterations)
  • Memory requirements are low (storage of kernel kk5, scaling vectors)
  • For sparse OT, projections onto top-kk6+simplex constraints are efficiently computable by sorting (kk7)
  • Closed-form expressions for gradients are available for both entropic and quadratic regimes

Root-finding for quantile thresholds in the semidiscrete setting exploits monotonicity and empirical process concentration (kk8 rates). Each top-k or quantile evaluation reduces to a linear program or explicit projection step (Zhu et al., 11 Feb 2026, Liu et al., 2022).

7. Applications and Extensions

The SOFT top-k paradigm is central to differentiable kk9-nearest neighbors, beam search, sparse mixture-of-experts layers, sparse attention modules, soft cumulative ranking/statistics, and quantile regression. In neural architectures, soft top-k layers allow for end-to-end training—replacing discontinuous selection with differentiable relaxations—while maintaining interpretability and efficient memory usage (Xie et al., 2020, Cuturi et al., 2019, Liu et al., 2022).

Extensions encompass:

  • Sorted SOFT top-k with explicit rank preservation
  • Unbalanced OT relaxations for adaptive cardinality
  • Geometric partitioning with soft power diagrams driven by quantile or tail cost objectives
  • Sparse assignment constraints for computational savings in large-scale or expert-model architectures

Empirical analyses indicate comparable or improved performance relative to cross-entropy or standard softmax models across diverse benchmarks.


The use of optimal transport, especially its entropic and sparsity-constrained regularizations, has enabled robust, efficient, and provably accurate soft top-k and quantile-based operations for core machine learning tasks (Zhu et al., 11 Feb 2026, Liu et al., 2022, Xie et al., 2020, Cuturi et al., 2019).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Transport (SOFT Top-k).