SeqTopK: Sequence-Level TopK Algorithms

Updated 21 December 2025

SeqTopK is a framework that selects top-K elements across entire sequences rather than per-token decisions, enabling globally optimized routing and privacy-aware data release.
It leverages sequence-level top-K identification to dynamically allocate resources and improve expert routing with minimal computational overhead.
Empirical and theoretical analyses demonstrate that SeqTopK scales effectively for tasks like mixture-of-experts, differentially private releases, and spatio-temporal pattern mining.

Sequence-Level TopK (SeqTopK) refers to a set of algorithms and methodologies designed to select or mine the top- $k$ entities (such as experts, items, or patterns) from data arranged as sequences, rather than making independent per-token or per-step selections. The sequence-level formulation allows dynamic resource allocation, joint privacy-preserving selection, and efficient mining across various domains including mixture-of-experts routing, differentially private data release, spatio-temporal event mining, and graph-based sequential pattern discovery. This article provides a comprehensive analysis of SeqTopK approaches, their algorithmic foundations, integration strategies, and empirical properties as documented in the referenced literature.

1. Algorithmic Foundations and Mathematical Formulation

SeqTopK generalizes the traditional TopK selection paradigm by considering the sequence as a unified entity. In mixture-of-experts (MoE) architectures, per-token TopK routing allocates exactly $K$ experts for each token $t$ in a sequence of length $T$ , activating $T \cdot K$ experts in total. SeqTopK instead selects the top $T \cdot K$ (token,expert) pairs from the complete $T \times E$ score matrix, performing a budget-constrained optimization across the sequence (Wen et al., 9 Nov 2025).

Formally, for routing logits $L_t \in \mathbb{R}^E$ per token, softmax produces $S_t$ ; the top $T \cdot K$ entries from $\textrm{vec}(S)\in\mathbb{R}^{T E}$ are selected, yielding a sparse routing mask $R$ :

$R_{t,e} = 1 \text{ if } (t,e) \text{ among top } T \cdot K \text{ in } S; \quad 0 \text{ otherwise}.$

In differentially private TopK release, SeqTopK employs a joint exponential mechanism over all length- $k$ sequences $s$ of $d$ elements, weighting each sequence by its maximum loss relative to true top-k values. The mechanism samples $s$ with probability proportional to $\exp(-\epsilon L(s)/2)$ , where $L(s) = \max_j [h_{(j)} - h[s[j]]]$ (Gillenwater et al., 2022, Wu et al., 2024).

In pattern mining, SeqTopK denotes discovery of $K$ most significant sequential patterns over event instances or database graphs, leveraging global ranking metrics such as DensityRatio or support over induced sequences (Maciąg, 2017, Lei et al., 2018).

2. Integration in Mixture-of-Experts Models

SeqTopK routing is implemented as a two-line code change in MoE frameworks. Rather than selecting $K$ experts per token within $\textrm{scores}[t,:].topk(K)$ , one performs a sequence-level topk over $\textrm{scores}.flatten(0,1).topk(T \cdot K)$ , mapping indices back to their (token,expert) coordinates (Wen et al., 9 Nov 2025).

This modification is parameter-free—no additional gating logic, thresholds, or losses are introduced. Crucially, pretrained MoE models configured for standard TopK can be fine-tuned to exploit SeqTopK's dynamic allocation in hundreds to thousands of steps. The overhead in training, pre-training, and decoding is consistently below 1%, ensured via memory and computation analysis.

3. Dynamic Resource Allocation and Statistical Behavior

SeqTopK's key property is adaptive expert allocation within a global budget. Under sequence-level competition, tokens with higher complexity (quantified by output entropy $H_t = -\sum_j p_{t,j} \log p_{t,j}$ ) capture more expert slots, while easy tokens may receive fewer than $K$ experts. Empirical studies report correlations above 0.7 between token entropy and expert allocation, with practical distributions spanning 1–2 experts for low-entropy tokens to 10–18 for high-entropy tokens (Wen et al., 9 Nov 2025).

In differentially private TopK release, sequence-level approaches avoid composition losses and allow utility functions that exploit order and joint structure, resulting in improved statistical fidelity for moderate $k$ values (Gillenwater et al., 2022, Wu et al., 2024).

SeqTopK pattern mining in spatio-temporal and database graph contexts utilizes global ranking (e.g., minimum DensityRatio across patterns, support over all induced sequences) combined with depth-first exploration and dynamic pruning (Maciąg, 2017, Lei et al., 2018).

4. Computational Complexity and Sampling Algorithms

Efficient sampling is central to SeqTopK. In DP TopK, Gillenwater et al. instantiate the exponential mechanism over $O(d^k)$ sequences but reduce sampling to $O(dk\log k + d\log d)$ in time and $O(dk)$ space by leveraging utility value structuring and combinatorial arguments (Gillenwater et al., 2022). FastJoint (Wu et al., 2024) further improves to $O\left(d + \frac{k^2}{\epsilon} \ln d\right)$ via group-by partitioning and loss truncation with theoretical guarantees.

In graph pattern mining, SeqTopK overcomes the combinatorial explosion of induced sequence databases by using two-step sampling: (1) uniform path sampling based on vertex degree and reachability, and (2) uniform transaction-sequence sampling per path. The weighted frequency estimator $\hat{f}(s)$ provides unbiased estimates, and with $m = O((l+1)\ln|Q_l|/\epsilon^2)$ samples, simultaneous approximation of all pattern frequencies is guaranteed (Lei et al., 2018).

Event-based spatio-temporal SeqTopK leverages spatial joins and recursive expansion, maintaining a top-K ranked list and applying dynamic thresholding (SeqIndex) for efficient pruning (Maciąg, 2017).

5. Empirical Performance and Use Cases

SeqTopK routing demonstrates consistent gains in LLM benchmarks on math, coding, law, and writing. On OLMoE-A1B-7B, SeqTopK yields 5.9% improvement at K=8, expanding to 12.5% at K=4 and 7.5% at K=2. Gains reach 16.9% under extreme sparsity (K=2) (Wen et al., 9 Nov 2025).

In DP TopK, joint exponential mechanisms achieve lower $\ell_\infty$ and $\ell_1$ errors compared to peeling and approximate protocols, especially when natural count gaps are present, and runtime remains manageable up to $k \sim 200$ at $d \sim 10^5$ (Gillenwater et al., 2022, Wu et al., 2024).

SeqTopK pattern mining on synthetic and real datasets verifies near-perfect ranking similarity and sublinear runtime scaling in $K$ , with practical applicability up to moderate sequence lengths and graph sizes (Maciąg, 2017, Lei et al., 2018).

6. Relation to Prior Work and Methodological Distinctions

SeqTopK differs substantively from:

ReMoE (ReLU gating + L $_1$ regularization): Per-token adaptivity and additional parameters/hyperparameters (Wen et al., 9 Nov 2025).
Top-P experts: Cumulative probability threshold tuning, risk of budget violation (Wen et al., 9 Nov 2025).
BatchTopK: Routing across batch dimension can implicate privacy; batch-size sensitivity (Wen et al., 9 Nov 2025).
MRL-TopK: Learns per-token budgets via auxiliary losses, with slower convergence (Wen et al., 9 Nov 2025).
Peeling (DP): Sequential independent TopK, composition loss (Gillenwater et al., 2022, Wu et al., 2024).
CDP-Peel (approximate DP): Gumbel noise, concentrated DP composition (Gillenwater et al., 2022, Wu et al., 2024).

SeqTopK maintains strict budget compliance, requires zero new parameters, and is robust to extreme sparsity regimes—a critical trait for next-generation sparse models and privacy-preserving data release.

7. Scalability, Theoretical Guarantees, and Implementation Notes

SeqTopK methods scale linearly or near-linearly in the number of items/domains ( $d$ ), experts ( $E$ ), and sequence length ( $T$ ). Overhead from sequence-level routing or sampling remains sublinear in practical scenarios, confirmed empirically and theoretically. Privacy proofs for joint mechanisms are direct from exponential-mechanism theory; pattern estimation guarantees derive from concentration inequalities.

Implementation in MoE requires minimal adjustment. In graph setting, precomputing reachability and leveraging standard pattern miners suffice; for private TopK, group-size computation and sample reconstruction utilize basic combinatorics, with pseudocode supplied in the references for direct use.

SeqTopK's sequence-centric perspective unifies adaptive allocation, privacy, and pattern discovery, offering demonstrable advantages over traditional TopK and parameter-heavy adaptive schemes. The approach has become integral in scalable language modeling, high-utility privacy release, and pattern mining in high-dimensional or relational settings (Wen et al., 9 Nov 2025, Gillenwater et al., 2022, Wu et al., 2024, Maciąg, 2017, Lei et al., 2018).