Multi-Block Decoding with Rejection Recycling

Updated 17 December 2025

The paper introduces a multi-block decoding method that leverages rejection recycling to significantly increase tokens accepted per iteration compared to traditional autoregressive and vanilla Jacobi approaches.
The method maintains key-value cache efficiency and uses parallel decoding under causal attention, making it compatible with modern GPU hardware and exploiting hardware-level parallelism.
Empirical evaluations report up to a 4× improvement in token acceptance and near 4× end-to-end speedup, demonstrating practical gains in efficiency and latency reduction.

Multi-block decoding with rejection recycling is an advanced methodology for accelerating inference in transformer-based LLMs, particularly those trained via Jacobi Forcing. This approach extends the Jacobi fixed-point paradigm by leveraging several blocks of parallel decoding with token-level recycling of rejected suffixes, thus significantly increasing the number of tokens accepted per iteration and reducing wall-clock latency relative to standard autoregressive (AR) or vanilla Jacobi schemes. The method is compatible with causal attention, preserves key-value (KV) cache efficiency, and is specifically tuned to exploit hardware-level parallelism in modern GPUs (Hu et al., 16 Dec 2025).

1. Formal Definition and Mechanism

Let $L$ be the total number of tokens to generate given a prompt. Choose a block size $n$ and a maximum number of in-flight blocks $K$ . At any iteration:

Maintain a set $B = \{b_1, ..., b_K\}$ $B = {b_{1}, ..., b_{K}}$ where each block $b$ $b$ stores:
- $q_b \in V^n$ : current draft (proposed tokens),
- $a_b \in V^{\leq n}$ : accepted prefix,
- RA $\in [1, K]$ : the index of the real-active block; all others are pseudo-active.
In each step:
1. All $K$ blocks are packed into a batch under causal attention.
2. A single forward pass produces logits for each position in every block.
3. For each block, the draft is "verified" using greedy decoding; the longest prefix matching the true AR trajectory is accepted.
4. For the real-active (RA) block, this accepted prefix is committed to the global KV-cache.
5. Rejected suffixes of RA are recycled into a candidate pool $N$ for subsequent proposals.
6. Pseudo-active blocks that reach a progress (spawn) threshold can trigger the creation of new pseudo-active blocks.
7. Blocks may be promoted to RA when RA's accepted prefix is full ( $|a_{RA}| \geq n$ ).
Termination occurs when the RA block accepts an <eos> token or all blocks reach full acceptance.

This approach allows multi-block parallelism under strict causal constraints, reusing rejected token sequences to maximize acceptance per iteration (Hu et al., 16 Dec 2025).

2. Pseudocode for Rejection Recycling

A condensed form of the algorithm is as follows:

Algorithm MultiBlock+RejectionRecycling
Input: p_θ, block size n, max blocks K, spawn ratio r, max iters T_max
Init: RA ← 1
      for b=1…K: q_b ← random tokens of length n if b==RA else []
                  a_b ← []
      N ← ∅, s ← ⌈r·n⌉

for t=1…T_max do
    batch ← [prompt; a_RA; q_RA, for each b≠RA: (a_b; q_b)]
    logits ← p_θ.forward(batch)

    for each b:
        g_b ← GreedyDecode(logits for q_b)
        m_b ← LongestPrefixMatch(g_b, AR_reference)
        a_b ← a_b ∥ g_b[1:m_b]
        if b==RA:
            reject_suffix ← g_b[m_b+1:]
            if reject_suffix ≠ []:
                N.insert(reject_suffix)
    CommitToKV(a_RA)
    if <eos> ∈ a_RA: return GeneratedSequence

    for each b:
        if b==RA: q_RA ← RefillDraftFromPoolOrRandom(N, n−|a_RA|)
        else: q_b ← pad(a_b) to n or []
    if |a_RA| ≥ s and #blocks < K:
        spawn new pseudo-active block from RA
    if |a_RA| ≥ n:
        promote a pseudo-active block to RA
end for
return fallback AR decoding

This pseudocode demonstrates the integration of parallel decode, suffix recycling, candidate pool management, and block spawning/promotion (Hu et al., 16 Dec 2025).

3. Key Formulas and Theoretical Analysis

At the heart of multi-block decoding with rejection recycling are several operational and theoretical formulas:

Jacobi fixed-point update per block $b$ , iteration $j$ :

$y_{i,b}^{(j+1)} = \arg\max_{n} p_\theta(y\,|\, [\text{prompt}; a_b^{(j)}_{<i}; q_b^{(j)}_{\geq i}])$

for $i=1 \dots n$ .

Greedy Verification:

$g_b = [g_1, ..., g_n] := \arg\max \text{ over each position of logits}$

$m_b = \max\{m \leq n : g_{1:m} = r_{1:m}\}$

where $r$ is the AR-greedy continuation.

Accepted-token count per iteration:

$\Delta T = \sum_{b=1}^K m_b$

Expected tokens-per-forward (TPF):

$\mathbb{E}[m_b] = \sum_{i=1}^n p_{b,i}$

$\mathbb{E}[\Delta T] = \sum_{b=1}^K \sum_{i=1}^n p_{b,i}$

End-to-end wall-clock speedup:

$\text{Speedup} \approx \frac{L}{T}$

where $L$ is the sequence length and $T$ is the number of iterations used by the method.

For typical settings (e.g., $n=64$ –$128$, $K=2$ ), empirical acceptance probability $p\approx0.03$ –$0.07$, and factor-4–5 token acceptance improvement is observed compared to vanilla Jacobi, until limited by GPU parallelism constraints (Hu et al., 16 Dec 2025).

4. Compute–Latency Trade-offs and Hardware Considerations

Multi-block decoding with rejection recycling intentionally trades additional FLOPs (due to verifying and managing multiple concurrent and recycled proposals) to achieve lower token-level latency. Each pass processes $K\cdot n$ proposal tokens plus up to $|N|$ candidates from the rejection pool. This extra computation is efficient and justifiable when the GPU has excess FLOPs budget up to its "roofline knee"—about 256 parallel tokens processed on H200/B200 GPUs and 128 on A100.

As $K$ increases, the per-iteration accepted token count $\Delta T$ improves super-linearly up to hardware limits, after which marginal speedup is diminished by roofline constraints. For typical Jacobi Forcing models under standard hyperparameters, this approach achieves a practical tokens-per-forward (TPF) increase from 1.00 (AR) to 4.09 (multi-block with recycling), with measured end-to-end speedups close to 4.0× (Hu et al., 16 Dec 2025).

5. Empirical Performance and Comparisons

Evaluations on HumanEval and math benchmarks validate the effectiveness of multi-block decoding with rejection recycling:

Setting	AR	Jacobi (vanilla)	CLLM	Multi-block + Recycling (JFMR)
TPF (HumanEval)	1.0	1.03	2.7	4.09
Speedup (A100/B200)	1.0	1.03	2.5	3.95–3.97
pass@1 (HumanEval)	87.8	not stated	–	83.5
Math solve rate (GSM8K)	–	–	–	>91%
Speedup (math, Qwen2.5)	–	–	–	3.7–3.8

In all tested scenarios, multi-block decoding with rejection recycling exceeds the throughput of both vanilla Jacobi and state-of-the-art diffusion LLMs (≤2.5× TPF, ≤1.8× speedup), while retaining a high fraction of the AR baseline's accuracy (Hu et al., 16 Dec 2025).

6. Hyperparameter Tuning

The efficiency of the method depends critically on several hyperparameters:

$n$ : Block size; optimal near the hardware throughput knee (64 for H200/B200, 128 for A100).
$K$ : Number of in-flight blocks; $K=2$ is generally optimal, higher $K$ yields diminishing returns.
$r$ : Spawn threshold as a ratio of $n$ ; best results for $r \approx 0.85$ .
$m$ : Maximum candidate verification size; $m=4$ balances quality and compute.
$|N|$ : Candidate pool size; $|N|\lesssim256$ recommended for GPU capacity.

Careful selection and dynamic adaptation of these parameters maximize throughput while avoiding inefficiencies or hardware over-saturation (Hu et al., 16 Dec 2025).

7. Implementation Guidelines and Practical Considerations

Effective deployment of multi-block decoding with rejection recycling requires:

Batching all $K$ blocks with a noise-aware causal mask in each pass (avoid bidirectional attention).
Ensuring KV-cache reuse by keeping the causal mask unchanged for RA blocks.
Maintaining a FIFO structure for the rejected suffix pool; batch-verify candidates efficiently.
Spawning new pseudo-active blocks only when the RA block surpasses its progress threshold, minimizing block fragmentation.
Promoting blocks to RA only upon full acceptance in the prior RA slot.
Tuning ( $n$ , $m$ , $K$ , $r$ ) in accordance with hardware roofline studies; saturate but do not exceed FLOPs capacity.
Utilizing mixed-precision and fused kernel implementations to reduce verification and overhead.
Instituting fallback to greedy AR decoding in case maximum iteration limits ( $T_{max}$ ) are reached to avoid potential infinite loops.

This suite of guidelines ensures both compatibility with the pretrained causal inference properties and practical speed and throughput benefits (Hu et al., 16 Dec 2025).

Markdown Upgrade to Chat

References (1)

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Block Decoding with Rejection Recycling.