Hard Top-k Routing Techniques

Updated 3 December 2025

Hard top-k routing is a discrete selection process that chooses exactly k top candidates under combinatorial and capacity constraints in applications like Mixture-of-Experts and route planning.
It employs strategies such as intra-GPU rectification and A*-style search to mitigate overflow and vacancy issues in both deep learning and graph optimization contexts.
Recent advances demonstrate improved token utilization and rapid sub-second route planning, balancing optimality, speed, and resource management.

Hard top-k routing refers to the class of combinatorial selection and allocation strategies in which, given a set of candidates (such as experts in neural MoE models or vertices in a network), exactly $k$ top-scoring or otherwise preferred options are chosen for each instance, subject to various constraints (such as capacity or cost). This paradigm arises in sparse deep learning architectures—most notably Mixture-of-Experts (MoE) models—and in combinatorial optimization for route planning on graphs, as exemplified by the top-k route search and KOSR (top-k optimal sequenced routes) literature. Hardness stems from the discrete and combinatorial nature of selection, non-convex constraints, and requirements for optimality or near-optimality, often compounded by per-selection limitations such as per-node expert token quotas, global budgets, or monotone submodular set objectives.

1. Definition and Scope of Hard Top-k Routing

Hard top-k routing encompasses problems and algorithms where, for each input instance (such as a token in MoE or a node in routing), the algorithm selects exactly $k$ recipients (experts, outgoing links, or next hops) based on some scoring function. Formally, for a set of scores $a_{ij}$ over $N$ instances and $M$ options, each instance $i$ is routed to the $k$ highest-scoring $j$ —the top- $k$ selection—often under additional per-option capacity or global cost constraints.

The “hardness” of top-k routing stems from its combinatorial search structure, the potential for conflicts under mutual constraints (such as overflows in MoE or budget/visit constraints in routing), and the impossibility of relaxing the problem into convex optimization as would be feasible in “soft” routing schemes. Key applications include:

Sparse Mixture-of-Experts neural networks, which route tokens to a subset of the experts per layer (Zeng et al., 17 Feb 2024).
Route recommendation and trip planning in transport and GIS, including sequenced visits to points of interest (POIs) under cost or feature constraints (Liu et al., 2018, Liang et al., 2017).

2. Hard Top-k Routing in Mixture-of-Experts Architectures

In sparse MoE models, hard top-k routing governs the assignment of each token representation $x_i\in\mathbb{R}^d$ to exactly $k$ experts chosen from $M$ candidates. The standard hard top-k gating procedure consists of:

Computing scores $a_{ij} = w_j^\top x_i$ for each expert $j$ .
Selecting $R_i$ as the set of size $k$ corresponding to the top- $k$ $a_{ij}$ for each token $i$ .
Computing normalized gates $g_{ij}$ for $j\in R_i$ via softmax over $R_i$ ; $g_{ij}=0$ otherwise.
Aggregating expert outputs as $o_i = \sum_{j\in R_i} g_{ij}E_j(x_i)$ .

Under capacity constraint $C$ per expert ( $C = \text{capacity\_factor} \cdot N/M$ ), hard top-k routing induces two inefficiencies:

Overflow (dropped tokens): $|Q_j| > C$ tokens routed to expert $j$ ; surplus are dropped, leading to ineffective compute utilization and potential accuracy drop.
Vacancy (padding): $|Q_j| < C$ triggers padding (often with zeros), wasting compute and harming model utility.

To address these, “Rectify-Router” introduces:

Intra-GPU Rectification: Reassign overflowed tokens to alternative experts residing on the same GPU, provided they have spare capacity, avoiding costly inter-GPU communication and maximizing local utilization.
Fill-in Rectification: Fill vacant slots at underutilized experts with next-best tokens (those ranking $(k+1)$ -th in their preference lists), increasing effective compute utilization and restoring gradient flow to informative tokens.

Combined, these methods recover nearly all “wasted” compute without altering expert capacity or incurring extra interconnect cost. Experimental results indicate notable increases in top-1 routing accuracy (from $38.74\%$ to $40.57\%$ on LLaMA-MoE 7B), with $<5\%$ training speed penalty and $<10\%$ inference slowdown (Zeng et al., 17 Feb 2024).

3. Combinatorial Hardness and Routing on Graphs

Hard top-k routing generalizes classic combinatorial problems such as path planning with attribute or order constraints. The top-k optimal sequenced routes (KOSR) problem, for example, seeks the $k$ least-cost routes from source $s$ to destination $t$ that visit one vertex in each of a prescribed category sequence $C=(C_1, \ldots, C_l)$ in order (Liu et al., 2018). This extends classic single-best-sequenced routing to the top- $k$ regime and is NP-hard when the category visitation order is unconstrained.

Practical algorithms for hard top-k routing in graphs include:

Dominance-pruning based expansion (PruningKOSR): Avoids redundant exploration by maintaining, for each partial route terminating at a vertex, only those not strictly dominated (i.e., lower cost) by other partials of the same length and endpoint. This allows the search space to contract from full $\prod |C_i|$ blowup to the scale of $\sum |C_i||C_{i+1}| + O(k\sum|C_i|)$ .
A*-style search with admissible heuristic (StarKOSR): Embeds the above in an A* search prioritizing completions with minimum estimated cost-to-go, leveraging two-hop labeling for constant-time distance queries.

Performance evaluations on large road networks show that StarKOSR reduces top-k search on graphs with $10^5$ – $10^6$ vertices and $k=30$ from hour-scale to subsecond query times, outpacing baselines by several orders of magnitude (Liu et al., 2018).

4. Submodular and Feature-aware Generalizations

In route recommendation and planning with explicit diversity, feature matching, or user personalization objectives, hard top-k routing is further complicated by set-based, submodular utility functions. The “top-k route search through submodularity modeling of recurrent POI features” problem formalizes route selection as the maximization of a monotone submodular gain (such as diversity of visited features) under cost or length constraints (Liang et al., 2017). The gain is defined, for a route as a set of POIs $\mathcal{P}_V$ , as:

$Gain(\mathcal{P}_V \mid Q) = \sum_{h \in \mathcal{H}} w_h \Phi_h(\mathcal{P}_V)$

with $\Phi_h$ submodular and monotone for each feature $h$ , and $w_h$ user-supplied weights.

Exact algorithms such as PACER employ:

Stateful enumeration: Each compact state is defined by a subset of candidate POIs; store for each such set the best gain and costs associated with particular endpoints.
Two-layer pruning: (i) Cost-dominance pruning—discard partials with higher cost for the same visited set and endpoint; (ii) Gain upper-bound pruning—if the best-completable gain from a state cannot beat the current $k$ -th best, prune further expansion using a continuous greedy bound on the remaining budget.
Heuristic variants (PACER-SC, greedy insertion) trade off optimality for speed in large candidate sets, achieving near-optimal results with $10\times$ – $100\times$ runtime saving for $|\mathcal{V}_Q| \leq 200$ and achieving solution qualities within 95–99% of optimal.

This submodular route-search framework provides provable optimality for small to moderate budget queries and near-optimality at scale (Liang et al., 2017).

5. Complexity, Scalability, and Trade-offs

The fundamental obstacle in hard top-k routing problems is the combinatorial explosion in candidate selections and assignments:

Scenario	Baseline Complexity	Advanced Algorithmic Complexity	Typical Empirical Speedup
MoE Sparse Routing (N tokens, M experts)	$O(NM)$ gating + all-to-all + $O(C \log C)$ per expert	$O(N \log k)$ local select, $O(C \log C)$ local rectification	$<10\%$ overhead for full utilization (Zeng et al., 17 Feb 2024)
KOSR on Graphs (l categories, $\|C_i\|$ per cat.)	$O(\prod_{i=1}^l \|C_i\|)$ worst case	$O(\sum\|C_i\|\|C_{i+1}\| + k\sum\|C_i\|)$ with pruning (Liu et al., 2018)	$10^3$ – $10^6\times$ faster
Submodular Routing (n POIs, p max route length)	$O(n!/(n-p)!)$ (brute force)	$O(\sum_{l=1}^{p} l\binom{n}{l})$ with PACER	$10$– $100\times$ (PACER+2 vs BF) (Liang et al., 2017)

Empirically, applying dominance and upper-bound pruning, as well as local rectification (for hard capacity), retains almost all the possible “value” (accuracy in MoE, gain in route search), while transforming intractable problems to those solvable in milliseconds to seconds for real-world data sizes.

6. Practical Considerations and Open Questions

Hard top-k routing solutions, while effective in their designated regimes, raise several open technical questions and avenues for further research:

Resource and locality-aware scheduling: In MoE, extending intra-GPU rectification to broader topologies, such as inter-rack communication where cost is still low, could further improve utilization (Zeng et al., 17 Feb 2024).
Adaptive and dynamic top-k selection: Recent work moves beyond static $k$ to adaptive or learnable $k$ per instance, though the detailed algorithms for such schemes are addressed outside the scope of the current hard static paradigm [(Yue et al., 14 Oct 2024), abstract].
Capacity-aware gating and fair allocation: Endowing routing networks with the ability to dynamically predict and accommodate variable slot availability could further reduce overflows and vacancies (Zeng et al., 17 Feb 2024).
Submodular knapsack and personalized diversity: Combining monotonicity and diminishing returns in set objectives complicates route enumeration and suggests a need for further theoretical work on scalable (approximate) search or amortized index-based approaches (Liang et al., 2017).
Scalability to larger or more dynamic contexts: Cache size, two-hop label pre-processing for graphs, and routing overhead for massive (or rapidly evolving) networks remain bottlenecks, suggesting potential for parallel, distributed, or streaming algorithmic extensions.

7. Representative Algorithms and Empirical Results

Key algorithmic contributions and experimental findings include:

Rectify-Router for MoE top-k routing: Recovers nearly 100% of expert token utilization, increases top-1 routing accuracy by 4.7%, with negligible runtime and no extra parameter cost (Zeng et al., 17 Feb 2024).
StarKOSR for top-k sequenced routes: Achieves sub-second top-30 answer times on graphs with $10^6$ vertices, with pruned search space two to four orders of magnitude smaller than full enumeration (Liu et al., 2018).
PACER and variants for submodular POI routing: PACER+2 achieves optimal top- $k$ on city-scale queries within seconds for moderate budgets, PACER-SC and GR provide near-optimal answers sub-second for larger-scale or more relaxed constraints, with systematic analysis of trade-offs between speed and gain (Liang et al., 2017).

Taken together, the evolution of hard top-k routing methodology links the discrete resource allocation perspective from deep learning with combinatorial optimization in graphs and submodular maximization, yielding unified insights into efficient, scalable selection under strict allocation constraints.