SeqTopK: Sequence-Level Top-K Selection

Updated 16 November 2025

SeqTopK is a method that generalizes top-k selection to operate on sequences, leveraging temporal, spatial, or structural context for improved adaptivity.
It is applied in diverse areas such as mixture-of-experts routing in language models, sequential pattern mining in spatio-temporal data and graph databases, and in differential privacy.
SeqTopK enhances statistical efficiency and resource allocation, yielding empirical improvements in scalability and performance across complex data-driven tasks.

Sequence-level TopK (SeqTopK) covers a family of methods that generalize “top- $k$ ” selection to operate at the level of sequences, rather than individual tokens, transactions, or items. This paradigm arises in multiple research areas—including Mixture-of-Experts (MoE) routing for LLMs, mining sequential patterns in spatio-temporal or graph-structured data, and privacy-preserving selection algorithms—united by the central idea that selection should aggregate across temporal, spatial, or structural context for improved adaptivity, statistical efficiency, or privacy properties. While the term “SeqTopK” is used variably, implementations share the feature that the selection (or filtering or routing) budget is distributed over the whole sequence, enabling data-dependent or context-sensitive allocation. The following sections synthesize the leading SeqTopK approaches across the literature, focusing on their formal definitions, algorithmic frameworks, key analytic results, practical implications, and open areas for further research.

1. Formal Definitions and Motivating Problem Classes

SeqTopK methodologies emerge in several settings, distinguished by the nature of sequences, the selection targets, and problem goals.

Mixture-of-Experts Routing (MoE Transformers)

For an input sequence of length $T$ to an MoE layer with $N$ experts, let $s_t \in \mathbb{R}^N$ be the softmax-normalized routing scores for token $t$ .

Standard TopK Routing: Per token, exactly $K$ experts are activated: $R_i(t) = 1$ if $i$ in TopK of $s_t$ , else $0$. Total activations: $T \cdot K$ .
Sequence-level TopK (SeqTopK) Routing: The $T \cdot K$ largest values in the flattened $[T \times N]$ score matrix are selected, so tokens compete globally for expert capacity:

$R_i(t) = \begin{cases} 1, & (t, i) \in \text{TopK}_\text{global} \ 0, & \text{otherwise} \end{cases}$

Optional per-token bounds (e.g., min 1, max $K+2$ experts) stabilize allocation (Wen et al., 9 Nov 2025).

Sequential Pattern Mining in Event-based and Graph Databases

Spatio-Temporal Event Sequences: Given a multiset of event-instances over types $F$ in a space $V$ , a sequential pattern $P = (e_1 \rightarrow \dots \rightarrow e_m)$ has significance $\phi(P)$ recursively defined using spatio-temporal density ratios. SeqTopK seeks the $K$ patterns $P$ of length $\ge$ min_len maximizing $\phi(P)$ (Maciąg, 2017).
Database Graphs: Vertices $v$ each have transaction databases $\mathcal{T}_v$ . SeqTopK seeks the top- $k$ frequent sequential patterns over transaction sequences induced by paths in the graph. Since the sequence space $|D_l|$ is exponential in $|V|$ , efficient approximation is required (Lei et al., 2018).

Differentially Private Sequence-level Top-k Selection

Given a universe of $d$ items, select a sequence of $k$ highest-score items (no repeats) in a way that preserves differential privacy with respect to input histograms, using mechanisms such as joint exponential with pruning (Wu et al., 14 Nov 2024).

2. Algorithmic Frameworks and Pseudocode

SeqTopK Routing in MoE Models

The minimal implementation changes from per-token TopK routing to SeqTopK are captured in the following pseudocode (PyTorch-style):

def forward(router_scores, K):
    # router_scores: [B,T,N], B=batch size
    B, T, N = router_scores.shape
    scores = softmax(router_scores, dim=-1)      # [B,T,N]
    flat = scores.view(B, T*N)                   # [B, T*N]
    k_seq = T*K
    vals, idx_flat = flat.topk(k_seq, dim=-1)    # [B, T*K]
    token_idx = idx_flat // N
    expert_idx = idx_flat % N
    R = torch.zeros_like(scores)
    R[batch_idx, token_idx, expert_idx] = 1
    # Optionally enforce token bounds here
    outputs = sum_{i=1}^N R[...,i] * Expert_i(inputs)
    return outputs

Autoregressive decoding requires an “online” variant and caching of routing choices (Wen et al., 9 Nov 2025).

Spatio-Temporal Event-Pattern SeqTopK

A recursive depth-first expansion enumerates and prunes candidate sequential patterns:

for each event type f in F:
    P = (f)
    tailEventSet(P) = D(f)
    φ(P) = +∞
    Expand(P)

procedure Expand(P):
    for each f' in F:
        S = SpatialJoin(tailEventSet(P), D(f'), ℛ, 𝒯)
        if S is empty: continue
        φ' = min(φ(P), DR(last(P)→f'))
        if size(TopK) == K and φ' ≤ TopK.minKey(): continue
        if length(P) + 1 ≥ min_len:
            if size(TopK) < K or φ' > TopK.minKey():
                if size(TopK) == K: TopK.popMin()
                TopK.push((φ', P→f'))
        if φ' > 1:
            tailEventSet(P→f') = S
            φ(P→f') = φ'
            Expand(P→f')

Spatial-enumeration scalability is achieved through index structures and early pruning (Maciąg, 2017).

Sampling-based SeqTopK in Database Graphs

SeqTopK in the context of graphs is addressed by a two-step sampling framework:

Uniformly sample length- $l$ paths.
For each path, uniformly sample a transaction sequence.
Estimate pattern frequencies on the sample to extract approximate top- $k$ patterns, using provable unbiased estimators with Chernoff-bounded error (Lei et al., 2018).

Differentially Private SeqTopK (Joint Exponential with Pruning)

Partition the space of sequences into loss-consistent groups.
Sample a group index using the Gumbel-max trick for the exponential mechanism.
Uniformly sample a sequence in the selected group. This strategy achieves $O(d + (k^2/\varepsilon)\ln d)$ runtime and preserves $\varepsilon$ -DP (Wu et al., 14 Nov 2024).

3. Theoretical Foundations, Analysis, and Complexity

MoE SeqTopK Routing

Budget is fixed ( $T K$ activations), but allocation is dynamic per sequence.
Overhead: $<1\%$ in both training cost and inference throughput compared to standard TopK.
Empirically, routing entropy and expert utilization uniformity are increased, improving adaptivity to token “difficulty” (Wen et al., 9 Nov 2025).

Spatio-Temporal Pattern Mining

For $|F|$ event types, pattern-length $L$ , and $N$ total event-instances: worst-case DFS is $O(|F|^L N \log N)$ ; practical early-pruning via dynamic top-K thresholding reduces tree expansion.
Use of spatial indexes (e.g., R-tree) decreases join complexity, enabling scalability to hundreds of thousands of events (Maciąg, 2017).

Sampling for Database Graphs

Problem is proven #P-hard; the sequence database is exponential in $|V|$ .
Sample-size bounds for $\varepsilon$ -precision and $\delta$ -confidence are derived explicitly:

$m \geq \frac{12}{\varepsilon^2 a} \ln\frac{2|Q_l|}{\delta}$

where $|Q_l|$ is the number of candidates (Lei et al., 2018).

Mean estimator error decays as $O(1/\sqrt{m})$ .

Differential Privacy

The exponential mechanism, when implemented over the sequence space with sensitivity-1 loss, achieves pure $\varepsilon$ -DP.
Pruning the loss beyond a threshold parameter $\tau$ controls tail probability failure, with negligible effect on average-case error (Wu et al., 14 Nov 2024).
Runtime is $O(d + (k^2/\varepsilon)\ln d)$ , an order of magnitude speedup over prior work.

4. Empirical Results and Comparative Performance

Approach	Setting	Key Empirical Results
MoE SeqTopK (Wen et al., 9 Nov 2025)	OLMoE-7B, Qwen1.5-MoE, others	+3–16.9% metric improvements over TopK. Gains largest under high sparsity (up to 52% rel. on GSM8K at $K=2$ ).
Spatio-Temp. Event Mining	Synthetic ~18k events	10–56s for $K=20$ –90, scalability linear in $\|D\|$ and $K$ .
Database Graph SeqTopK	Cit. graphs ( $10^5$ nodes, $10^6$ edges)	$>0.95$ AP/RS with $m\ll\|D_l\|$ , runtime (minutes), exact baseline infeasible.
Private SeqTopK (Wu et al., 14 Nov 2024)	$d = 5$ –$166$k items, $k\ll d$	10–100 $\times$ faster than Joint; matches $\ell_1 / \ell_\infty$ accuracy; runtime decreases with $\epsilon$ .

These results confirm the practical benefit and scalability of sequence-level TopK schemes across domains.

5. Practical Considerations and Implementation Guidelines

MoE Routing: Often implemented as a single .topk()-related code change; fully compatible with pretrained models, requiring only brief fine-tuning. Online decoding requires incremental expert allocation and a routing cache.
Spatial Pattern Mining: Choice of min_len, $K$ , spatial–temporal radius, and pattern-length cap should be guided by domain constraints and memory limits. Pre-indexing raw events is critical.
Database Graphs: Sample-size should be set per error-bounds, typically orders of magnitude less than universe size. Pattern mining can use standard sequential-pattern miners on the sampled sequences.
Differential Privacy: Histogram pre-processing and explicit group partitioning enable fast, private, and accurate sequence selection scalable to large $d$ .

6. Limitations and Directions for Future Research

Pathological Allocations: In MoE, unbounded token-to-expert assignments can cause degenerate routing, mitigated by minimal per-token bounds (Wen et al., 9 Nov 2025).
Non-causal/global training: SeqTopK oracles in training are unavailable for causal inference; online variants achieve near-oracle performance but have minor lags.
Hardness of Exact Mining: Both spatio-temporal and graph sequential pattern mining problems are #P-hard, making sampling and efficient indexing essential (Lei et al., 2018).
Differentiable Relaxations: Current SeqTopK recruiters (excluding soft variants) are non-differentiable; future work could develop continuous relaxations for improved gradient flow.
Combinatorial Extensions: Integrating sequence-level routing with other MoE mechanisms (grouping, merging, pruning), multimodal data, or multi-task objectives remains open.
Theoretical Budget Allocation: Formal analysis of optimal expert-budgets or selection budgets at the sequence level (especially for deep architectures or compositional privacy) is still undeveloped.

7. Conceptual Significance and Comparative Synthesis

SeqTopK, in all its instantiations, embodies a shift from rigid, per-point resource allocation toward context-sensitive, sequence-aware selection. This leads to efficiency gains, improved adaptivity, tighter statistical guarantees (in the case of pattern mining and privacy), and empirical increases in utility for a fixed compute or privacy budget. The approach is realized with minimal code or algorithmic change but substantial impact across deep learning, data mining, and privacy applications. As sequence data and large models remain central in computational research, sequence-level selection mechanisms are likely to become increasingly influential in both theoretical and applied work.

PDF Markdown Chat (Pro)

References (4)

Route Experts by Sequence, not by Token (2025)

Efficient Discovering of Top-K Sequential Patterns in Event-Based Spatio-Temporal Data (2017)

Mining Top-k Sequential Patterns in Database Graphs:A New Challenging Problem and a Sampling-based Approach (2018)

Faster Differentially Private Top-$k$ Selection: A Joint Exponential Mechanism with Pruning (2024)

Follow Topic

Get notified by email when new papers are published related to Sequence-level TopK (SeqTopK).