Hybrid Top-k/Top-p Masking
- Hybrid Top-k/Top-p masking is a method that combines Top-k and Top-p procedures to select elements based on both fixed count and cumulative probability, promoting sparsity and adaptive control.
- The algorithm efficiently computes a minimal support set through sorting, cumulative summation, and renormalized softmax, making it practical for block-sparse attention and adaptive decoding.
- Applied in models like SpargeAttention2 and language decoders, hybrid masking improves diversity, accelerates attention computation, and maintains quality under extreme sparsity conditions.
Hybrid Top-k/Top-p Masking defines a class of algorithms that combine two common support-selection procedures—Top-k and Top-p (nucleus)—for promoting sparsity in neural attention and sampling. By enforcing both a minimum element count (Top-k) and a cumulative distribution mass threshold (Top-p), hybrid masking offers provably stricter or more adaptive constraints on the selection of tokens, attention blocks, or vocabulary elements, controlling diversity and precision even in regimes where pure Top-k or Top-p fail. This construct serves as a principled regularisation and efficiency tool, enabling both block-sparse attention in large diffusion models and robust, adaptive decoding in sequence generation tasks (Zhang et al., 13 Feb 2026, Ji et al., 20 Feb 2026).
1. Formal Definition and Construction
Hybrid Top-k/Top-p masking selects elements from a probability vector or attention score row by the union or intersection of two sets:
- The elements (or blocks) with highest scores (Top-k)
- The minimal prefix of elements with cumulative probability at least (Top-p)
Let be a probability vector and the corresponding scores. The basic Top-k set identifies the largest entries; Top-p includes the smallest prefix covering at least of the total mass. The hybrid mask with union semantics is:
where indexes the Top-k elements, and indexes the minimal Top-p prefix. The intersection form (used in LLM decoding (Ji et al., 20 Feb 2026)) restricts support to at most elements covering at least mass, i.e.,
where is minimal such that , and sorts in descending order.
In block-sparse attention (e.g., SpargeAttention2), the hybrid mask at row of the pooled attention matrix is:
with and defined per row (Zhang et al., 13 Feb 2026).
2. Algorithmic Implementation
The hybrid masking procedure consists of the following steps (notation from (Zhang et al., 13 Feb 2026, Ji et al., 20 Feb 2026)):
- Compute Softmax probabilities or attention weights for the current row or token logits.
- Sort in descending order; accumulate cumulative sums to determine the minimal prefix for the Top-p mass.
- Derive and set the support size as .
- Select the support set .
- Assign zero probability to all elements outside the support; inside, compute entropic weights via normalised exponentiation.
The following pseudocode implements the hybrid mask used in decoding (Ji et al., 20 Feb 2026):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def hybrid_topk_p_mask(s, p, K, P, temperature): idx = argsort_desc(p) cumsum = 0.0 mP = 0 for r in range(len(p)): cumsum += p[idx[r]] if cumsum >= P: mP = r+1 break if mP == 0: mP = len(p) m = min(K, mP) S = idx[:m] logits = s[S] / temperature max_logit = logits.max() expw = np.exp(logits - max_logit) w = expw / expw.sum() q = np.zeros_like(p) q[S] = w return q |
For block-sparse attention, the hybrid mask computation is embedded in CUDA kernels that fuse masking and sparse softmax computation, skipping all masked-out blocks and eliminating unnecessary memory and compute (Zhang et al., 13 Feb 2026).
3. Theoretical Motivation and Failure Modes
Hybrid Top-k/Top-p masking is motivated by the complementary failure modes of pure Top-k and Top-p selection (Zhang et al., 13 Feb 2026):
- Uniform distributions: If attention or output probabilities are flat, Top-k drops a large number of plausibly useful items, since mass is dispersed; Top-p includes most tokens, reducing sparsity.
- Peaky (skewed) distributions: If a few elements dominate, Top-p often collapses to those "sinks," discarding secondary yet still important mass; Top-k enforces a minimal set size, recovering diversity.
The hybrid union avoids both pitfalls: the Top-k component prevents undercoverage when mass is concentrated, while the Top-p component ensures sufficient mass is retained when needed. Analysis of the resulting error in attention summaries demonstrates that the hybrid selection yields uniformly lower error across both distributional regimes (Zhang et al., 13 Feb 2026).
4. Applications in Attention and Decoding
Block-Sparse Attention
Hybrid Top-k/Top-p masking is a central component of SpargeAttention2 for video diffusion transformers. By combining both constraints, the hybrid mask enables extremely high sparsity (e.g., 95%) in the attention map while maintaining score coverage required for high-fidelity generation. The corresponding CUDA implementation, built upon FlashAttention2, ensures that masked blocks are skipped in both forward and backward passes, yielding significant latency and memory reductions (Zhang et al., 13 Feb 2026).
LLM Decoding
In decoding, the hybrid mask is used to construct an adaptive sampler that interpolates between Top-k and Top-p regimes. Explicitly, the hybrid sampler:
- Never selects more than tokens per step,
- Always retains at least cumulative probability mass.
This dual constraint enables fine-grained control of the diversity–precision trade-off in generation, preventing runaway support growth at high temperature (as occurs with Top-p alone) and preserving tail mass when confidence is low (mitigating Top-k narrowness) (Ji et al., 20 Feb 2026).
5. Training and Fine-Tuning Protocols
For block-sparse attention, direct fine-tuning on sparse masks using the base diffusion objective can degrade model quality, especially when data and mask distributions are mismatched (Zhang et al., 13 Feb 2026). SpargeAttention2 addresses this via velocity distillation: a teacher–student framework where the student (sparse attention) matches the full-attention teacher's velocity field predictions, using only noisy inputs from the fine-tuning set. No standard MSE on sample reconstructions is minimised; rather, the student is optimised to reproduce the teacher's flow-matching vector field. This approach stabilises training, ensuring that extreme sparsity (e.g., 95%) does not degrade output quality.
6. Complexity and Practical Trade-Offs
The computational complexity of hybrid Top-k/Top-p masking is dominated by the sort operation ( for support selection). Additional steps—cumulative sum and per-support renormalised Softmax—incur and cost, respectively. For practical efficiency, a -heap can be maintained on to achieve selection time in typical settings (Ji et al., 20 Feb 2026).
Hybrid masking introduces a two-dimensional parameter space:
- Model is “peaky”: ; support is small, quality resembles Top-p, and tail risk is low.
- Model is flat: ; support is capped by , constraining computational cost and diversity (like Top-k).
Empirically, configurations such as , provide high-quality generations, often outperforming either constraint alone. Tuning offers control over the diversity–faithfulness spectrum, enabling robust adaptation to model uncertainty as measured by local confidence (Ji et al., 20 Feb 2026, Zhang et al., 13 Feb 2026).
7. Empirical Results
On Wan2.1 video diffusion models, SpargeAttention2 with hybrid masking achieves:
- 95% attention sparsity,
- Attention latency speedup of 16.2× (e.g., 97 s to 6 s on a 1.3B model),
- End-to-end video generation speedup: 2.3× for 1.3B, 4.7× for 14B,
- Generation metrics (IQ, OC, AQ, VR, VQA-a, VQA-t) equal to or exceeding the full-attention baseline and outperforming prior sparse methods (SpargeAttention, VSA, VMoBA, SLA).
Example results:
| Model | Sparsity | Full Attn IQ | Hybrid IQ | Full OC | Hybrid OC |
|---|---|---|---|---|---|
| 1.3B @ 480p | 95% | 63.7 | 67.7 | 20.3 | 21.6 |
| 14B @ 720p | 95% | 68.0 | 69.1 | 22.4 | 21.6 |
Qualitative outputs indicate superior text–video alignment and temporal coherence under extreme sparsity (Zhang et al., 13 Feb 2026). In decoding, hybrid masking adapts dynamically, maintaining quality in both high- and low-confidence regimes (Ji et al., 20 Feb 2026).