TopKSim: Differentiable Top-K Averaging

Updated 30 January 2026

Top-K Averaging (TopKSim) is a class of differentiable operators that approximates hard top-k selection using continuous relaxations.
It employs tournament-style and LapSum-based methods to transform discrete ranking into a smooth, gradient-friendly process.
TopKSim is applied in retrieval, ranking, and set aggregation tasks, demonstrating state-of-the-art efficiency and accuracy on benchmarks.

Top-K Averaging (TopKSim) refers to a class of differentiable operators designed to aggregate the top- $k$ elements of a set of vectors according to a learned, input-specific score. In classical settings, selecting and averaging the top- $k$ elements is non-differentiable due to hard thresholding and discrete sorting operations, making it incompatible with gradient-based optimization. Modern TopKSim operators use continuous relaxations of the top- $k$ selection process, enabling end-to-end learning in neural architectures for retrieval, ranking, and set-aggregation tasks. Representative methods include the Successive Halving Top- $k$ Operator (Pietruszka et al., 2020) and the LapSum framework (Struski et al., 8 Mar 2025), both of which have been shown to offer efficient, scalable, and accurate approximations of the hard top- $k$ average.

1. Mathematical Formulation of TopKSim

Given a matrix of representations $E = [E_1;E_2;\dots;E_n] \in \mathbb{R}^{n \times d}$ and a corresponding score vector $v = (v_1, \ldots, v_n) \in \mathbb{R}^n$ , the classical (hard) top- $k$ averaging operator is defined by

$y = \frac{1}{k} \sum_{i \in \mathrm{TopK}} E_i,$

where $\mathrm{TopK}$ denotes the indices of the $k$ largest values in $v$ . This operation reduces a set to its $k$ most "important" representatives based on scoring, and takes their unweighted mean.

The hard top- $k$ mask $m \in \{0,1\}^n$ has $m_i=1$ for those $i$ corresponding to the $k$ -largest $v_i$ and $m_i=0$ otherwise, with $\sum_i m_i = k$ . TopKSim relaxes $m$ to a continuous, differentiable vector $\tilde m \in [0,1]^n$ , maintaining $\sum_i \tilde m_i = k$ , and forms the soft average,

$\tilde y = \frac{1}{k} \sum_{i=1}^n \tilde m_i E_i.$

As the temperature or sharpness parameter of the relaxation tends to its limiting value (e.g., $\alpha \to 0^+$ or $C \to \infty$ ), $\tilde m$ approaches $m$ and $\tilde y \to y$ .

2. Differentiable Relaxations of Top- $k$ Selection

Tournament-Style (Successive Halving) Relaxation

The Successive Halving Top- $k$ Operator (Pietruszka et al., 2020) replaces the discrete top- $k$ selection with a differentiable, tournament-based elimination process. The method:

Iteratively pairs candidates in $v$ , computes a boosted two-element softmax for each pair,
Merges features and scores according to softmax weights,
Halves the number of active candidates at each round, repeating until $k$ survivors remain,
Tracks soft contributions via a sequence of merge matrices, yielding a continuous selection mask $\tilde m$ .

The mask is formally given as

$\tilde m_i = \sum_{p=1}^k M_{i,p},$

where $M$ accumulates the effects of the per-round pairwise soft eliminations. Differentiability is preserved via standard softmax gradient propagation; the entire procedure is compatible with end-to-end backpropagation.

LapSum-Based (Closed-Form) Relaxation

LapSum (Struski et al., 8 Mar 2025) provides a general mechanism for differentiable order-statistics using the sum of Laplace cumulative distribution functions (CDFs). For a (sorted) score vector $r \in \mathbb{R}^n$ and scale parameter $\alpha$ , define

$\operatorname{LapSum}_\alpha(x;r) = \sum_{i=0}^{n-1} \operatorname{Lap}_\alpha(x - r_i),$

where $\operatorname{Lap}_\alpha(x)$ is the CDF of a Laplace distribution scaled by $\alpha$ .

Soft top- $k$ proceeds by:

Identifying $b = \operatorname{LapSum}_\alpha^{-1}(k; r)$ , the unique "threshold" such that the relaxed selection sums to $k$ ,
Setting $p_i = \operatorname{Lap}_\alpha(b - r_i)$ , giving a soft mask $p \in (0,1)^n$ with $\sum_i p_i = k$ ,
Forming the normalized averaging weights $w_i = p_i / k$ and soft mean $y = \sum_i w_i x_i$ .

The entire procedure—forward (TopKSim) and backward (gradient)—is $O(n \log n)$ in time and $O(n)$ in memory.

3. Computational Complexity and Implementation

Both tournament-style and LapSum-based TopKSim methods are designed for efficiency.

Tournament-Style (Successive Halving): Each round pairs and softmaxes $O(n)$ candidates, requiring $O(\log(n/k))$ rounds. Total cost is $O(n)$ arithmetical operations plus optional $O(n \log n)$ for per-round sorts. Compared to prior iterative softmax relaxations ( $O(n k)$ ), this is substantially faster for moderate $k \ll n$ (Pietruszka et al., 2020).
LapSum: LapSum requires a single $O(n \log n)$ presort, $O(n)$ forward/backward passes, and $O(\log n)$ search/inversion steps with all operations amenable to parallelization. Implementations are concise in C++/CUDA, as primitives are simple scans and pointwise updates (Struski et al., 8 Mar 2025).

This efficiency makes both approaches scalable to large values of $n$ and $k$ , overcoming practical limitations of earlier differentiable sorting, ranking, or selection layers.

4. Gradient Backpropagation and End-to-End Training

TopKSim layers, as constructed using either Successive Halving or LapSum, support fully differentiable backpropagation.

Tournament-Style: Gradients flow through each boosted-softmax and merge stage, with per-pair derivatives following standard softmax calculus, and the complete gradient obtained via the chain rule.
LapSum: The derivative of the output mean $y$ w.r.t. the inputs $x$ is a simple weighted average gradient. For gradients w.r.t. the scores $r$ , Jacobians are computed analytically in closed-form based on implicit function theorem calculus, requiring only quantities already computed in the forward pass.

This property is essential for integrating TopKSim modules into neural models for learning similarity, aggregation, and retrieval functions.

5. Empirical Performance and Comparison with Prior Methods

Empirically, LapSum-based soft top- $k$ /TopKSim achieves state-of-the-art accuracy, speed, and memory use in large-scale ranking and aggregation tasks.

On benchmarks such as CIFAR-100 (ResNet-18) and ImageNet-1K/21K-P (ResNeXt-101), LapSum consistently matches or surpasses prior soft-sort and soft-permute methods (e.g., NeuralSort, SoftSort, SinkhornSort) in Top-1 and Top-5 accuracy, especially as class count increases.
For large $n$ and $k$ , LapSum is 2–5× faster than existing alternatives and remains $O(n)$ in memory, even as competitors become infeasible.
The total weight error $|\sum_i p_i - k|$ is at machine precision, addressing a common deficiency in other relaxations (Struski et al., 8 Mar 2025).
The Successive Halving operator (Pietruszka et al., 2020) exhibits higher normalized Chamfer cosine similarity (nCCS) than iterative baselines for all tested $(n, k)$ , with accuracy degrading only slowly as $\log(n/k)$ increases and provably converging to the hard top- $k$ as $C\to\infty$ .

6. Integration and Applications in Retrieval and Set Aggregation

TopKSim layers are directly usable in neural architectures requiring soft top- $k$ aggregation for retrieval, similarity, or set representation tasks.

For retrieval, TopKSim provides a continuous relaxation of "best match" or "top-k" pooling, which can be embedded into ranking losses or metric learning pipelines.
Set aggregation tasks benefit from TopKSim by obtaining permutation-invariant, differentiable summaries of salient elements.
The sharpness or temperature parameter (e.g., $C$ for Successive Halving, $\alpha$ for LapSum) can be tuned or annealed to control the bias-variance tradeoff between faithful top- $k$ emulation and smooth optimization.

Table 1 summarizes key computational features of major TopKSim methods:

Method	Forward/Backward Cost	Memory	Notes
Successive Halving Top-k (Pietruszka et al., 2020)	$O(n)$ – $O(n\log n)$	$O(n)$	Chain of soft tournament merges
LapSum (Struski et al., 8 Mar 2025)	$O(n\log n)$	$O(n)$	Closed-form, invertible soft threshold
Iterated Softmax (prior art)	$O(nk)$	$O(n)$	$k$ global softmax passes needed

Both frameworks enable practical, fully differentiable top- $k$ aggregation for large datasets and architectures.

7. Summary and Outlook

TopKSim denotes a family of differentiable top- $k$ averaging operators, with representative realizations including the Successive Halving operator (Pietruszka et al., 2020) and the LapSum methodology (Struski et al., 8 Mar 2025). These methods provide mathematically principled, computationally efficient, and empirically robust means to approximate hard top- $k$ selection and averaging in neural applications requiring optimization through subset selection. Their modularity and compatibility with standard autodiff allow seamless integration into ranking, retrieval, and robust permutation-invariant architectures. A plausible implication is that continued developments in closed-form, scalable, and accurate soft order-statistics will further broaden the range of tasks amenable to differentiable subset selection.

Markdown Report Issue Upgrade to Chat

References (2)

Successive Halving Top-k Operator (2020)

LapSum -- One Method to Differentiate Them All: Ranking, Sorting and Top-k Selection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Top-K Averaging (TopKSim).

TopKSim: Differentiable Top-K Averaging

1. Mathematical Formulation of TopKSim

2. Differentiable Relaxations of Top- $k$ Selection

Tournament-Style (Successive Halving) Relaxation

LapSum-Based (Closed-Form) Relaxation

3. Computational Complexity and Implementation

4. Gradient Backpropagation and End-to-End Training

5. Empirical Performance and Comparison with Prior Methods

6. Integration and Applications in Retrieval and Set Aggregation

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TopKSim: Differentiable Top-K Averaging

1. Mathematical Formulation of TopKSim

2. Differentiable Relaxations of Top-kkk Selection

Tournament-Style (Successive Halving) Relaxation

LapSum-Based (Closed-Form) Relaxation

3. Computational Complexity and Implementation

4. Gradient Backpropagation and End-to-End Training

5. Empirical Performance and Comparison with Prior Methods

6. Integration and Applications in Retrieval and Set Aggregation

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2. Differentiable Relaxations of Top- $k$ Selection