Top-k and Differentiable Masking

Updated 5 May 2026

Top-k and differentiable masking are techniques to enable sparse selection in neural networks by approximating hard, non-differentiable operators with smooth surrogates.
They utilize various relaxation frameworks such as entropic optimal transport, dynamic programming, and Gumbel sampling to maintain effective gradient flow and computational efficiency.
Applications span NLP, vision, and recommender systems, enhancing model interpretability, sparsity control, and scalability in deep learning architectures.

Top-k and differentiable masking are foundational operations for sparse selection in neural networks, enabling structured control of information flow in deep architectures. The hard top-k operator, which selects the $k$ largest (or smallest) elements of an input vector, is non-differentiable and ill-suited for gradient-based learning. Differentiable masking refers to smooth, trainable relaxations of such discrete selection mechanisms, allowing end-to-end optimization in large-scale models. The rich literature on these topics encompasses optimal transport–based relaxations, differentiable dynamic programming, convex-regularized linear programs, efficient Gumbel sampling, and amortized gating networks, with applications spanning vision, language, recommender systems, structured prediction, and scientific computing.

1. Mathematical Formulations of Top-k and Differentiable Masking

The hard top-k operator $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ outputs a binary vector that marks the $k$ highest-scoring positions in its argument. Mathematically, this can be written as:

$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$

This operation's discontinuity means $\frac{\partial}{\partial s_i} \mathrm{TOPK}_k(s)$ vanishes almost everywhere, precluding backpropagation. Differentiable masking aims to construct a smooth surrogate $m(s) \in [0,1]^n$ with $\sum_i m_i = k$ that approximates the hard mask while exposing well-behaved gradients.

Several frameworks have been established for achieving this:

Entropic optimal transport relaxation: SOFT top-k via EOT, yielding a mask as the marginal of the transport plan (Xie et al., 2020).
Dynamic programming with soft-max recursion: Smoothing the combinatorial DP underlying knapsack/top-k to obtain differentiable, structured masks (Vivier-Ardisson et al., 29 Jan 2026).
Convex analysis and isotonic optimization: Solving a regularized linear program over the capped simplex, with solutions via isotonic regression (Sander et al., 2023).
Laplace CDF–based invertible maps: LapSum constructs soft top-k and sorting by inverting sums of Laplace CDFs, yielding closed-form, O( $n\log n$ ) solutions (Struski et al., 8 Mar 2025).
Relaxed capped simplex projection: DFTopK proposes a quadratic program for thresholded caps, leading to linear-time, closed-form masks (Zhu et al., 13 Oct 2025).
Gumbel-based relaxed sampling: Using the Gumbel-Top-k trick for stochastic, differentiable selection in subset sampling (Jeon et al., 18 Jan 2025).
Amortized gating via learned probes: DiffMask attaches small MLPs at each network layer to output masks through Hard-Concrete reparameterization (Cao et al., 2020).

2. Algorithms and Relaxation Techniques

The core of any differentiable masking operator is the surrogate function, which balances faithfulness to the discrete mask with gradient tractability and computational efficiency.

Sinkhorn/EOT-based soft top-k: Formulate the selection as mass transport from items to $\{0,1\}$ , entropize the cost function, and extract top-k marginal via iterative Sinkhorn scaling. Forward and backward algorithms run in $O(n)$ per pass; sorted variants support order-aware selections (Xie et al., 2020).
Dynamic programming (DP) smoothing: Replace hard max in the Bellman recursion for top-k with convex regularized (e.g., Shannon entropy or Gini/Tsallis) soft-max, enabling efficient, batched vector-Jacobian products and pathwise gradients. Shannon entropy gives equivariance; alternative regularizers trade off sparsity (Vivier-Ardisson et al., 29 Jan 2026).
Convex regularization + isotonic regression: The capped simplex constraint is regularized with a $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 0-norm. The optimal mask is found by a sequence of isotonic merges (“pool adjacent violators”) or via Dykstra's projections, all in $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 1 or better (Sander et al., 2023).
LapSum closed-form inversion: Soft top-k selection is framed as finding the threshold $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 2 where the sum of Laplace CDFs matches $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 3, then constructing a mask as $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 4. Analytical formulas for derivatives and masks enable efficient large-scale use (Struski et al., 8 Mar 2025).
Capped simplex QP / DFTopK: Solve $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 5 with $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 6 via linear-time thresholding and projection, yielding nearly diagonal Jacobians and O( $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 7) runtime (Zhu et al., 13 Oct 2025).
Gumbel-Top-k relaxation: Sequential, temperature-controlled Gumbel perturbations with straight-through gradients, selecting $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 8 samples without replacement for efficient inference and gradient flow (Jeon et al., 18 Jan 2025).
Successive halving: Tournament style selection via repeated two-way softmax pairings reduces complexity relative to full softmax enumeration and is amenable to parallel execution (Pietruszka et al., 2020).
Amortized Hard-Concrete gating: DIFFMASK networks combine per-layer MLP probes with a Hard-Concrete gate per position, forming $\mathrm{TOPK}_k: \mathbb R^n \to \{0,1\}^n$ 9 by reparameterized sampling and soft-OR across layers (Cao et al., 2020).

3. Integration in Neural Architectures

Differentiable masking and top-k selection are critical in architecting sparsity, interpretability, and structured computation. Integration strategies include:

Feature and input attribution: DIFFMASK enables layer-wise analysis of input contribution by learning the minimal unmasked subset required to preserve model predictions, supporting both input and hidden-layer masking (Cao et al., 2020).
Sparse attention: Soft top-k is embedded in attention modules to restrict computation to the $k$ 0 best keys, with masking implemented via entropic OT or LapSum relaxations. CUDA-fused kernels support tractable blockwise masking for diffusion models (Zhang et al., 13 Feb 2026).
Mixture-of-Experts (MoE) routing: Convex or LapSum-based top-k gates select a subset of experts for each example, enabling dynamic-path and resource-aware inference (Sander et al., 2023 Struski et al., 8 Mar 2025).
Sparse k-NN and ranking: Soft top-k with optimal transport, LapSum, or dynamic programming regularizers enables fully end-to-end, differentiable approximate-nearest-neighbor selection and ranking-metric computation in recommender systems (Xie et al., 2020 Lee et al., 2020 Zhu et al., 13 Oct 2025).
Patch/evidence selection in medical imaging: Differentiable top-k modules are used to sample and aggregate relevant 3D patches, replacing sliding window inference in segmentation or maximizing anomaly localization (Jeon et al., 18 Jan 2025 Huang et al., 2023).
Structured RL and decision-focused learning: Knapsack-style DP relaxations allow action or assortment selection in resource-constrained decision latent spaces, supporting direct regret or reward-based objectives (Vivier-Ardisson et al., 29 Jan 2026).

4. Comparison to Hard Top-k and Masking

Traditional hard top-k selection is optimal for subset cardinality but inaccessible to gradient-based learning, and various masking heuristics (greedy, beam search, erasure) suffer from either combinatorial intractability or hindsight bias (Cao et al., 2020). Differentiable alternatives offer:

Gradient flow and supervisor signal: All differentiable top-k surrogates expose dense or block-sparse gradients to upstream layers, enabling parameter learning that directly reflects selection and weighting impact (Xie et al., 2020 Zhu et al., 13 Oct 2025).
Budget and sparsity control: Through Lagrangian constraints, entropy regularization, or hard-thresholding post hoc, these methods allow fine-grained control over the expected or exact selected subset size (Sander et al., 2023 Cao et al., 2020).
Statistical faithfulness: Differentiable methods can attribute relevance to features as they become significant across layers or sequence positions, rather than only after output scoring, supporting interpretability (Cao et al., 2020).
Computational scaling: While early approaches (e.g., iterative softmax, O( $k$ 1)) struggled with large $k$ 2 or $k$ 3, modern methods (LapSum, DFTopK, Dykstra, Gumbel) achieve $k$ 4 or $k$ 5 with efficient memory footprints (Struski et al., 8 Mar 2025 Zhu et al., 13 Oct 2025 Sander et al., 2023).

5. Empirical Results and Applications

Extensive experiments in various domains have established the practical impact of differentiable masking and top-k:

NLP (BERT/SQuAD/SST): DIFFMASK discards up to 95% of tokens by the final layer with negligible loss of QA accuracy (90%+ retained). Sentiment-relevant tokens persist across layers, while function words drop early (Cao et al., 2020).
Vision (MLP pruning, transformer routing, patch selection): Soft top-k pruning maintains accuracy with 90% sparsity, fine-tuned vision transformers achieve lower top- $k$ 6 error for $k$ 7, and mixture-of-experts routing yields precision gains (Sander et al., 2023 Struski et al., 8 Mar 2025).
Recommender systems and ranking: Differentiable ranking metrics based on relaxed sorting achieve $k$ 8 to $k$ 9 improvements in NDCG/Hit@K over baselines, with mild computational overhead (Lee et al., 2020 Zhu et al., 13 Oct 2025).
Medical imaging (segmentation, anomaly detection): Differentiable top-k patch sampling reduces inference cost 9–11× with no accuracy degradation, and differentiable top-k feature adaptation outperforms hard selection in AUROC by $[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$0–$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$1 (Jeon et al., 18 Jan 2025 Huang et al., 2023).
Efficient sparse attention: SpargeAttention2 achieves $[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$2 sparsity and 16.2× attention-module speedup on video diffusion, outperforming all prior methods at the same quality (Zhang et al., 13 Feb 2026).

6. Computational Complexity and Implementation Insights

Methodologically, modern differentiable top-k methods differ significantly in asymptotic and realized performance:

Methodology	Complexity	Gradient Properties
Hard sort/top-k	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$3	Zero (nondiff.)
Sinkhorn/EOT	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$4	Dense Jacobian
Dynamic prog. smooth	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$5/$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$6	Pathwise, equivariant
LapSum	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$7	Closed-form, sparse
Capped simplex/DFTopK	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$8	Nearly diagonal
Successive Halving	$[\mathrm{TOPK}_k(s)]_i = \begin{cases} 1 & \text{if } s_i \text{ is among the $k $largest entries of$ s$} \ 0 & \text{otherwise} \end{cases}$9	Composed, backprop
Gumbel-Top-k	$\frac{\partial}{\partial s_i} \mathrm{TOPK}_k(s)$ 0	Stochastic, ST estimator

Efficient implementations rely on batch vectorization, fused CUDA kernels (block-sparse attention (Zhang et al., 13 Feb 2026)), and order-statistics routines. For convolutional or transformer-based architectures, differentiable masking is plug-and-play, requiring minor head or router modifications.

7. Theoretical Perspectives and Open Directions

Recent work has rigorously characterized the regularizers permitting equivariance (only Shannon entropy yields full permutation equivariance for soft max recurrences) (Vivier-Ardisson et al., 29 Jan 2026). Sparsity can be enforced by choosing regularizers with bounded derivatives (e.g., Gini, Tsallis). Analytical gradient expressions (LapSum, capped simplex) minimize global competition among selections, addressing “gradient conflicts” plaguing sorting-based surrogates (Zhu et al., 13 Oct 2025).

Open directions include combining subsetwise and budget-adaptive sparsity, hybrid discrete-continuous selection (e.g., blockwise top-k + top-p union), and leveraging differentiable masking to probe decision emergence across network depths (as in DIFFMASK (Cao et al., 2020)). The ubiquity of these operators across scientific and industrial workloads further motivates continued advances in scalability, stability, and interpretive fidelity.