Papers
Topics
Authors
Recent
2000 character limit reached

SeqTopK: Sequence-Level TopK Algorithms

Updated 21 December 2025
  • SeqTopK is a framework that selects top-K elements across entire sequences rather than per-token decisions, enabling globally optimized routing and privacy-aware data release.
  • It leverages sequence-level top-K identification to dynamically allocate resources and improve expert routing with minimal computational overhead.
  • Empirical and theoretical analyses demonstrate that SeqTopK scales effectively for tasks like mixture-of-experts, differentially private releases, and spatio-temporal pattern mining.

Sequence-Level TopK (SeqTopK) refers to a set of algorithms and methodologies designed to select or mine the top-kk entities (such as experts, items, or patterns) from data arranged as sequences, rather than making independent per-token or per-step selections. The sequence-level formulation allows dynamic resource allocation, joint privacy-preserving selection, and efficient mining across various domains including mixture-of-experts routing, differentially private data release, spatio-temporal event mining, and graph-based sequential pattern discovery. This article provides a comprehensive analysis of SeqTopK approaches, their algorithmic foundations, integration strategies, and empirical properties as documented in the referenced literature.

1. Algorithmic Foundations and Mathematical Formulation

SeqTopK generalizes the traditional TopK selection paradigm by considering the sequence as a unified entity. In mixture-of-experts (MoE) architectures, per-token TopK routing allocates exactly KK experts for each token tt in a sequence of length TT, activating TKT \cdot K experts in total. SeqTopK instead selects the top TKT \cdot K (token,expert) pairs from the complete T×ET \times E score matrix, performing a budget-constrained optimization across the sequence (Wen et al., 9 Nov 2025).

Formally, for routing logits LtREL_t \in \mathbb{R}^E per token, softmax produces StS_t; the top TKT \cdot K entries from vec(S)RTE\textrm{vec}(S)\in\mathbb{R}^{T E} are selected, yielding a sparse routing mask RR:

Rt,e=1 if (t,e) among top TK in S;0 otherwise.R_{t,e} = 1 \text{ if } (t,e) \text{ among top } T \cdot K \text{ in } S; \quad 0 \text{ otherwise}.

In differentially private TopK release, SeqTopK employs a joint exponential mechanism over all length-kk sequences ss of dd elements, weighting each sequence by its maximum loss relative to true top-k values. The mechanism samples ss with probability proportional to exp(ϵL(s)/2)\exp(-\epsilon L(s)/2), where L(s)=maxj[h(j)h[s[j]]]L(s) = \max_j [h_{(j)} - h[s[j]]] (Gillenwater et al., 2022, Wu et al., 2024).

In pattern mining, SeqTopK denotes discovery of KK most significant sequential patterns over event instances or database graphs, leveraging global ranking metrics such as DensityRatio or support over induced sequences (Maciąg, 2017, Lei et al., 2018).

2. Integration in Mixture-of-Experts Models

SeqTopK routing is implemented as a two-line code change in MoE frameworks. Rather than selecting KK experts per token within scores[t,:].topk(K)\textrm{scores}[t,:].topk(K), one performs a sequence-level topk over scores.flatten(0,1).topk(TK)\textrm{scores}.flatten(0,1).topk(T \cdot K), mapping indices back to their (token,expert) coordinates (Wen et al., 9 Nov 2025).

This modification is parameter-free—no additional gating logic, thresholds, or losses are introduced. Crucially, pretrained MoE models configured for standard TopK can be fine-tuned to exploit SeqTopK's dynamic allocation in hundreds to thousands of steps. The overhead in training, pre-training, and decoding is consistently below 1%, ensured via memory and computation analysis.

3. Dynamic Resource Allocation and Statistical Behavior

SeqTopK's key property is adaptive expert allocation within a global budget. Under sequence-level competition, tokens with higher complexity (quantified by output entropy Ht=jpt,jlogpt,jH_t = -\sum_j p_{t,j} \log p_{t,j}) capture more expert slots, while easy tokens may receive fewer than KK experts. Empirical studies report correlations above 0.7 between token entropy and expert allocation, with practical distributions spanning 1–2 experts for low-entropy tokens to 10–18 for high-entropy tokens (Wen et al., 9 Nov 2025).

In differentially private TopK release, sequence-level approaches avoid composition losses and allow utility functions that exploit order and joint structure, resulting in improved statistical fidelity for moderate kk values (Gillenwater et al., 2022, Wu et al., 2024).

SeqTopK pattern mining in spatio-temporal and database graph contexts utilizes global ranking (e.g., minimum DensityRatio across patterns, support over all induced sequences) combined with depth-first exploration and dynamic pruning (Maciąg, 2017, Lei et al., 2018).

4. Computational Complexity and Sampling Algorithms

Efficient sampling is central to SeqTopK. In DP TopK, Gillenwater et al. instantiate the exponential mechanism over O(dk)O(d^k) sequences but reduce sampling to O(dklogk+dlogd)O(dk\log k + d\log d) in time and O(dk)O(dk) space by leveraging utility value structuring and combinatorial arguments (Gillenwater et al., 2022). FastJoint (Wu et al., 2024) further improves to O(d+k2ϵlnd)O\left(d + \frac{k^2}{\epsilon} \ln d\right) via group-by partitioning and loss truncation with theoretical guarantees.

In graph pattern mining, SeqTopK overcomes the combinatorial explosion of induced sequence databases by using two-step sampling: (1) uniform path sampling based on vertex degree and reachability, and (2) uniform transaction-sequence sampling per path. The weighted frequency estimator f^(s)\hat{f}(s) provides unbiased estimates, and with m=O((l+1)lnQl/ϵ2)m = O((l+1)\ln|Q_l|/\epsilon^2) samples, simultaneous approximation of all pattern frequencies is guaranteed (Lei et al., 2018).

Event-based spatio-temporal SeqTopK leverages spatial joins and recursive expansion, maintaining a top-K ranked list and applying dynamic thresholding (SeqIndex) for efficient pruning (Maciąg, 2017).

5. Empirical Performance and Use Cases

SeqTopK routing demonstrates consistent gains in LLM benchmarks on math, coding, law, and writing. On OLMoE-A1B-7B, SeqTopK yields 5.9% improvement at K=8, expanding to 12.5% at K=4 and 7.5% at K=2. Gains reach 16.9% under extreme sparsity (K=2) (Wen et al., 9 Nov 2025).

In DP TopK, joint exponential mechanisms achieve lower \ell_\infty and 1\ell_1 errors compared to peeling and approximate protocols, especially when natural count gaps are present, and runtime remains manageable up to k200k \sim 200 at d105d \sim 10^5 (Gillenwater et al., 2022, Wu et al., 2024).

SeqTopK pattern mining on synthetic and real datasets verifies near-perfect ranking similarity and sublinear runtime scaling in KK, with practical applicability up to moderate sequence lengths and graph sizes (Maciąg, 2017, Lei et al., 2018).

6. Relation to Prior Work and Methodological Distinctions

SeqTopK differs substantively from:

SeqTopK maintains strict budget compliance, requires zero new parameters, and is robust to extreme sparsity regimes—a critical trait for next-generation sparse models and privacy-preserving data release.

7. Scalability, Theoretical Guarantees, and Implementation Notes

SeqTopK methods scale linearly or near-linearly in the number of items/domains (dd), experts (EE), and sequence length (TT). Overhead from sequence-level routing or sampling remains sublinear in practical scenarios, confirmed empirically and theoretically. Privacy proofs for joint mechanisms are direct from exponential-mechanism theory; pattern estimation guarantees derive from concentration inequalities.

Implementation in MoE requires minimal adjustment. In graph setting, precomputing reachability and leveraging standard pattern miners suffice; for private TopK, group-size computation and sample reconstruction utilize basic combinatorics, with pseudocode supplied in the references for direct use.

SeqTopK's sequence-centric perspective unifies adaptive allocation, privacy, and pattern discovery, offering demonstrable advantages over traditional TopK and parameter-heavy adaptive schemes. The approach has become integral in scalable language modeling, high-utility privacy release, and pattern mining in high-dimensional or relational settings (Wen et al., 9 Nov 2025, Gillenwater et al., 2022, Wu et al., 2024, Maciąg, 2017, Lei et al., 2018).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sequence-Level TopK (SeqTopK).