Hybrid Top-k/Top-p Masking

Updated 7 March 2026

Hybrid Top-k/Top-p masking is a method that combines Top-k and Top-p procedures to select elements based on both fixed count and cumulative probability, promoting sparsity and adaptive control.
The algorithm efficiently computes a minimal support set through sorting, cumulative summation, and renormalized softmax, making it practical for block-sparse attention and adaptive decoding.
Applied in models like SpargeAttention2 and language decoders, hybrid masking improves diversity, accelerates attention computation, and maintains quality under extreme sparsity conditions.

Hybrid Top-k/Top-p Masking defines a class of algorithms that combine two common support-selection procedures—Top-k and Top-p (nucleus)—for promoting sparsity in neural attention and sampling. By enforcing both a minimum element count (Top-k) and a cumulative distribution mass threshold (Top-p), hybrid masking offers provably stricter or more adaptive constraints on the selection of tokens, attention blocks, or vocabulary elements, controlling diversity and precision even in regimes where pure Top-k or Top-p fail. This construct serves as a principled regularisation and efficiency tool, enabling both block-sparse attention in large diffusion models and robust, adaptive decoding in sequence generation tasks (Zhang et al., 13 Feb 2026, Ji et al., 20 Feb 2026).

1. Formal Definition and Construction

Hybrid Top-k/Top-p masking selects elements from a probability vector or attention score row by the union or intersection of two sets:

The $k\%$ elements (or blocks) with highest scores (Top-k)
The minimal prefix of elements with cumulative probability at least $p\%$ (Top-p)

Let $p \in \mathbb{R}^n$ be a probability vector and $s \in \mathbb{R}^n$ the corresponding scores. The basic Top-k set identifies the $k$ largest entries; Top-p includes the smallest prefix covering at least $p$ of the total mass. The hybrid mask with union semantics is:

$\mathcal{S}_{\rm hybrid} = \mathcal{K} \cup \mathcal{P}$

where $\mathcal{K}$ indexes the Top-k elements, and $\mathcal{P}$ indexes the minimal Top-p prefix. The intersection form (used in LLM decoding (Ji et al., 20 Feb 2026)) restricts support to at most $K$ elements covering at least $p\%$ 0 mass, i.e.,

$p\%$ 1

where $p\%$ 2 is minimal such that $p\%$ 3, and $p\%$ 4 sorts $p\%$ 5 in descending order.

In block-sparse attention (e.g., SpargeAttention2), the hybrid mask at row $p\%$ 6 of the pooled attention matrix $p\%$ 7 is:

$p\%$ 8

with $p\%$ 9 and $p \in \mathbb{R}^n$ 0 defined per row (Zhang et al., 13 Feb 2026).

2. Algorithmic Implementation

The hybrid masking procedure consists of the following steps (notation from (Zhang et al., 13 Feb 2026, Ji et al., 20 Feb 2026)):

Compute Softmax probabilities or attention weights $p \in \mathbb{R}^n$ 1 for the current row or token logits.
Sort $p \in \mathbb{R}^n$ 2 in descending order; accumulate cumulative sums $p \in \mathbb{R}^n$ 3 to determine the minimal prefix for the Top-p mass.
Derive $p \in \mathbb{R}^n$ 4 and set the support size as $p \in \mathbb{R}^n$ 5.
Select the support set $p \in \mathbb{R}^n$ 6.
Assign zero probability to all elements outside the support; inside, compute entropic weights via normalised exponentiation.

The following pseudocode implements the hybrid mask used in decoding (Ji et al., 20 Feb 2026):

$k$ 2

For block-sparse attention, the hybrid mask computation is embedded in CUDA kernels that fuse masking and sparse softmax computation, skipping all masked-out blocks and eliminating unnecessary memory and compute (Zhang et al., 13 Feb 2026).

3. Theoretical Motivation and Failure Modes

Hybrid Top-k/Top-p masking is motivated by the complementary failure modes of pure Top-k and Top-p selection (Zhang et al., 13 Feb 2026):

Uniform distributions: If attention or output probabilities are flat, Top-k drops a large number of plausibly useful items, since mass is dispersed; Top-p includes most tokens, reducing sparsity.
Peaky (skewed) distributions: If a few elements dominate, Top-p often collapses to those "sinks," discarding secondary yet still important mass; Top-k enforces a minimal set size, recovering diversity.

The hybrid union avoids both pitfalls: the Top-k component prevents undercoverage when mass is concentrated, while the Top-p component ensures sufficient mass is retained when needed. Analysis of the resulting $p \in \mathbb{R}^n$ 7 error in attention summaries demonstrates that the hybrid selection yields uniformly lower error across both distributional regimes (Zhang et al., 13 Feb 2026).

4. Applications in Attention and Decoding

Block-Sparse Attention

Hybrid Top-k/Top-p masking is a central component of SpargeAttention2 for video diffusion transformers. By combining both constraints, the hybrid mask enables extremely high sparsity (e.g., 95%) in the attention map while maintaining score coverage required for high-fidelity generation. The corresponding CUDA implementation, built upon FlashAttention2, ensures that masked blocks are skipped in both forward and backward passes, yielding significant latency and memory reductions (Zhang et al., 13 Feb 2026).

LLM Decoding

In decoding, the hybrid mask is used to construct an adaptive sampler that interpolates between Top-k and Top-p regimes. Explicitly, the hybrid sampler:

Never selects more than $p \in \mathbb{R}^n$ 8 tokens per step,
Always retains at least $p \in \mathbb{R}^n$ 9 cumulative probability mass.

This dual constraint enables fine-grained control of the diversity–precision trade-off in generation, preventing runaway support growth at high temperature (as occurs with Top-p alone) and preserving tail mass when confidence is low (mitigating Top-k narrowness) (Ji et al., 20 Feb 2026).

5. Training and Fine-Tuning Protocols

For block-sparse attention, direct fine-tuning on sparse masks using the base diffusion objective can degrade model quality, especially when data and mask distributions are mismatched (Zhang et al., 13 Feb 2026). SpargeAttention2 addresses this via velocity distillation: a teacher–student framework where the student (sparse attention) matches the full-attention teacher's velocity field predictions, using only noisy inputs from the fine-tuning set. No standard MSE on sample reconstructions is minimised; rather, the student is optimised to reproduce the teacher's flow-matching vector field. This approach stabilises training, ensuring that extreme sparsity (e.g., 95%) does not degrade output quality.

6. Complexity and Practical Trade-Offs

The computational complexity of hybrid Top-k/Top-p masking is dominated by the sort operation ( $s \in \mathbb{R}^n$ 0 for support selection). Additional steps—cumulative sum and per-support renormalised Softmax—incur $s \in \mathbb{R}^n$ 1 and $s \in \mathbb{R}^n$ 2 cost, respectively. For practical efficiency, a $s \in \mathbb{R}^n$ 3-heap can be maintained on $s \in \mathbb{R}^n$ 4 to achieve $s \in \mathbb{R}^n$ 5 selection time in typical settings (Ji et al., 20 Feb 2026).

Hybrid masking introduces a two-dimensional parameter space:

Model is “peaky”: $s \in \mathbb{R}^n$ 6; support is small, quality resembles Top-p, and tail risk is low.
Model is flat: $s \in \mathbb{R}^n$ 7; support is capped by $s \in \mathbb{R}^n$ 8, constraining computational cost and diversity (like Top-k).

Empirically, configurations such as $s \in \mathbb{R}^n$ 9, $k$ 0 provide high-quality generations, often outperforming either constraint alone. Tuning $k$ 1 offers control over the diversity–faithfulness spectrum, enabling robust adaptation to model uncertainty as measured by local confidence (Ji et al., 20 Feb 2026, Zhang et al., 13 Feb 2026).

7. Empirical Results

On Wan2.1 video diffusion models, SpargeAttention2 with hybrid masking achieves:

95% attention sparsity,
Attention latency speedup of 16.2× (e.g., 97 s to 6 s on a 1.3B model),
End-to-end video generation speedup: 2.3× for 1.3B, 4.7× for 14B,
Generation metrics (IQ, OC, AQ, VR, VQA-a, VQA-t) equal to or exceeding the full-attention baseline and outperforming prior sparse methods (SpargeAttention, VSA, VMoBA, SLA).

Example results:

Model	Sparsity	Full Attn IQ	Hybrid IQ	Full OC	Hybrid OC
1.3B @ 480p	95%	63.7	67.7	20.3	21.6
14B @ 720p	95%	68.0	69.1	22.4	21.6

Qualitative outputs indicate superior text–video alignment and temporal coherence under extreme sparsity (Zhang et al., 13 Feb 2026). In decoding, hybrid masking adapts dynamically, maintaining quality in both high- and low-confidence regimes (Ji et al., 20 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning (2026)

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid Top-k/Top-p Masking.

Hybrid Top-k/Top-p Masking

1. Formal Definition and Construction

2. Algorithmic Implementation

3. Theoretical Motivation and Failure Modes

4. Applications in Attention and Decoding

Block-Sparse Attention

LLM Decoding

5. Training and Fine-Tuning Protocols

6. Complexity and Practical Trade-Offs

7. Empirical Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hybrid Top-k/Top-p Masking

1. Formal Definition and Construction

2. Algorithmic Implementation

3. Theoretical Motivation and Failure Modes

4. Applications in Attention and Decoding

Block-Sparse Attention

LLM Decoding

5. Training and Fine-Tuning Protocols

6. Complexity and Practical Trade-Offs

7. Empirical Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research