SeqTopK: Sequence-Level Top-K Selection
- SeqTopK is a method that generalizes top-k selection to operate on sequences, leveraging temporal, spatial, or structural context for improved adaptivity.
- It is applied in diverse areas such as mixture-of-experts routing in language models, sequential pattern mining in spatio-temporal data and graph databases, and in differential privacy.
- SeqTopK enhances statistical efficiency and resource allocation, yielding empirical improvements in scalability and performance across complex data-driven tasks.
Sequence-level TopK (SeqTopK) covers a family of methods that generalize “top-” selection to operate at the level of sequences, rather than individual tokens, transactions, or items. This paradigm arises in multiple research areas—including Mixture-of-Experts (MoE) routing for LLMs, mining sequential patterns in spatio-temporal or graph-structured data, and privacy-preserving selection algorithms—united by the central idea that selection should aggregate across temporal, spatial, or structural context for improved adaptivity, statistical efficiency, or privacy properties. While the term “SeqTopK” is used variably, implementations share the feature that the selection (or filtering or routing) budget is distributed over the whole sequence, enabling data-dependent or context-sensitive allocation. The following sections synthesize the leading SeqTopK approaches across the literature, focusing on their formal definitions, algorithmic frameworks, key analytic results, practical implications, and open areas for further research.
1. Formal Definitions and Motivating Problem Classes
SeqTopK methodologies emerge in several settings, distinguished by the nature of sequences, the selection targets, and problem goals.
Mixture-of-Experts Routing (MoE Transformers)
For an input sequence of length to an MoE layer with experts, let be the softmax-normalized routing scores for token .
- Standard TopK Routing: Per token, exactly experts are activated: if in TopK of , else $0$. Total activations: .
- Sequence-level TopK (SeqTopK) Routing: The largest values in the flattened score matrix are selected, so tokens compete globally for expert capacity:
Optional per-token bounds (e.g., min 1, max experts) stabilize allocation (Wen et al., 9 Nov 2025).
Sequential Pattern Mining in Event-based and Graph Databases
- Spatio-Temporal Event Sequences: Given a multiset of event-instances over types in a space , a sequential pattern has significance recursively defined using spatio-temporal density ratios. SeqTopK seeks the patterns of length min_len maximizing (Maciąg, 2017).
- Database Graphs: Vertices each have transaction databases . SeqTopK seeks the top- frequent sequential patterns over transaction sequences induced by paths in the graph. Since the sequence space is exponential in , efficient approximation is required (Lei et al., 2018).
Differentially Private Sequence-level Top-k Selection
Given a universe of items, select a sequence of highest-score items (no repeats) in a way that preserves differential privacy with respect to input histograms, using mechanisms such as joint exponential with pruning (Wu et al., 14 Nov 2024).
2. Algorithmic Frameworks and Pseudocode
SeqTopK Routing in MoE Models
The minimal implementation changes from per-token TopK routing to SeqTopK are captured in the following pseudocode (PyTorch-style):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def forward(router_scores, K): # router_scores: [B,T,N], B=batch size B, T, N = router_scores.shape scores = softmax(router_scores, dim=-1) # [B,T,N] flat = scores.view(B, T*N) # [B, T*N] k_seq = T*K vals, idx_flat = flat.topk(k_seq, dim=-1) # [B, T*K] token_idx = idx_flat // N expert_idx = idx_flat % N R = torch.zeros_like(scores) R[batch_idx, token_idx, expert_idx] = 1 # Optionally enforce token bounds here outputs = sum_{i=1}^N R[...,i] * Expert_i(inputs) return outputs |
Spatio-Temporal Event-Pattern SeqTopK
A recursive depth-first expansion enumerates and prunes candidate sequential patterns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for each event type f in F: P = (f) tailEventSet(P) = D(f) φ(P) = +∞ Expand(P) procedure Expand(P): for each f' in F: S = SpatialJoin(tailEventSet(P), D(f'), ℛ, 𝒯) if S is empty: continue φ' = min(φ(P), DR(last(P)→f')) if size(TopK) == K and φ' ≤ TopK.minKey(): continue if length(P) + 1 ≥ min_len: if size(TopK) < K or φ' > TopK.minKey(): if size(TopK) == K: TopK.popMin() TopK.push((φ', P→f')) if φ' > 1: tailEventSet(P→f') = S φ(P→f') = φ' Expand(P→f') |
Sampling-based SeqTopK in Database Graphs
SeqTopK in the context of graphs is addressed by a two-step sampling framework:
- Uniformly sample length- paths.
- For each path, uniformly sample a transaction sequence.
- Estimate pattern frequencies on the sample to extract approximate top- patterns, using provable unbiased estimators with Chernoff-bounded error (Lei et al., 2018).
Differentially Private SeqTopK (Joint Exponential with Pruning)
- Partition the space of sequences into loss-consistent groups.
- Sample a group index using the Gumbel-max trick for the exponential mechanism.
- Uniformly sample a sequence in the selected group. This strategy achieves runtime and preserves -DP (Wu et al., 14 Nov 2024).
3. Theoretical Foundations, Analysis, and Complexity
MoE SeqTopK Routing
- Budget is fixed ( activations), but allocation is dynamic per sequence.
- Overhead: in both training cost and inference throughput compared to standard TopK.
- Empirically, routing entropy and expert utilization uniformity are increased, improving adaptivity to token “difficulty” (Wen et al., 9 Nov 2025).
Spatio-Temporal Pattern Mining
- For event types, pattern-length , and total event-instances: worst-case DFS is ; practical early-pruning via dynamic top-K thresholding reduces tree expansion.
- Use of spatial indexes (e.g., R-tree) decreases join complexity, enabling scalability to hundreds of thousands of events (Maciąg, 2017).
Sampling for Database Graphs
- Problem is proven #P-hard; the sequence database is exponential in .
- Sample-size bounds for -precision and -confidence are derived explicitly:
where is the number of candidates (Lei et al., 2018).
- Mean estimator error decays as .
Differential Privacy
- The exponential mechanism, when implemented over the sequence space with sensitivity-1 loss, achieves pure -DP.
- Pruning the loss beyond a threshold parameter controls tail probability failure, with negligible effect on average-case error (Wu et al., 14 Nov 2024).
- Runtime is , an order of magnitude speedup over prior work.
4. Empirical Results and Comparative Performance
| Approach | Setting | Key Empirical Results |
|---|---|---|
| MoE SeqTopK (Wen et al., 9 Nov 2025) | OLMoE-7B, Qwen1.5-MoE, others | +3–16.9% metric improvements over TopK. Gains largest under high sparsity (up to 52% rel. on GSM8K at ). |
| Spatio-Temp. Event Mining | Synthetic ~18k events | 10–56s for –90, scalability linear in and . |
| Database Graph SeqTopK | Cit. graphs ( nodes, edges) | AP/RS with , runtime (minutes), exact baseline infeasible. |
| Private SeqTopK (Wu et al., 14 Nov 2024) | –$166$k items, | 10–100 faster than Joint; matches accuracy; runtime decreases with . |
These results confirm the practical benefit and scalability of sequence-level TopK schemes across domains.
5. Practical Considerations and Implementation Guidelines
- MoE Routing: Often implemented as a single
.topk()-related code change; fully compatible with pretrained models, requiring only brief fine-tuning. Online decoding requires incremental expert allocation and a routing cache. - Spatial Pattern Mining: Choice of min_len, , spatial–temporal radius, and pattern-length cap should be guided by domain constraints and memory limits. Pre-indexing raw events is critical.
- Database Graphs: Sample-size should be set per error-bounds, typically orders of magnitude less than universe size. Pattern mining can use standard sequential-pattern miners on the sampled sequences.
- Differential Privacy: Histogram pre-processing and explicit group partitioning enable fast, private, and accurate sequence selection scalable to large .
6. Limitations and Directions for Future Research
- Pathological Allocations: In MoE, unbounded token-to-expert assignments can cause degenerate routing, mitigated by minimal per-token bounds (Wen et al., 9 Nov 2025).
- Non-causal/global training: SeqTopK oracles in training are unavailable for causal inference; online variants achieve near-oracle performance but have minor lags.
- Hardness of Exact Mining: Both spatio-temporal and graph sequential pattern mining problems are #P-hard, making sampling and efficient indexing essential (Lei et al., 2018).
- Differentiable Relaxations: Current SeqTopK recruiters (excluding soft variants) are non-differentiable; future work could develop continuous relaxations for improved gradient flow.
- Combinatorial Extensions: Integrating sequence-level routing with other MoE mechanisms (grouping, merging, pruning), multimodal data, or multi-task objectives remains open.
- Theoretical Budget Allocation: Formal analysis of optimal expert-budgets or selection budgets at the sequence level (especially for deep architectures or compositional privacy) is still undeveloped.
7. Conceptual Significance and Comparative Synthesis
SeqTopK, in all its instantiations, embodies a shift from rigid, per-point resource allocation toward context-sensitive, sequence-aware selection. This leads to efficiency gains, improved adaptivity, tighter statistical guarantees (in the case of pattern mining and privacy), and empirical increases in utility for a fixed compute or privacy budget. The approach is realized with minimal code or algorithmic change but substantial impact across deep learning, data mining, and privacy applications. As sequence data and large models remain central in computational research, sequence-level selection mechanisms are likely to become increasingly influential in both theoretical and applied work.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free