Multi-Block Decoding with Rejection Recycling
- The paper introduces a multi-block decoding method that leverages rejection recycling to significantly increase tokens accepted per iteration compared to traditional autoregressive and vanilla Jacobi approaches.
- The method maintains key-value cache efficiency and uses parallel decoding under causal attention, making it compatible with modern GPU hardware and exploiting hardware-level parallelism.
- Empirical evaluations report up to a 4× improvement in token acceptance and near 4× end-to-end speedup, demonstrating practical gains in efficiency and latency reduction.
Multi-block decoding with rejection recycling is an advanced methodology for accelerating inference in transformer-based LLMs, particularly those trained via Jacobi Forcing. This approach extends the Jacobi fixed-point paradigm by leveraging several blocks of parallel decoding with token-level recycling of rejected suffixes, thus significantly increasing the number of tokens accepted per iteration and reducing wall-clock latency relative to standard autoregressive (AR) or vanilla Jacobi schemes. The method is compatible with causal attention, preserves key-value (KV) cache efficiency, and is specifically tuned to exploit hardware-level parallelism in modern GPUs (Hu et al., 16 Dec 2025).
1. Formal Definition and Mechanism
Let be the total number of tokens to generate given a prompt. Choose a block size and a maximum number of in-flight blocks . At any iteration:
- Maintain a set where each block stores:
- : current draft (proposed tokens),
- : accepted prefix,
- RA : the index of the real-active block; all others are pseudo-active.
- In each step:
- All blocks are packed into a batch under causal attention.
- A single forward pass produces logits for each position in every block.
- For each block, the draft is "verified" using greedy decoding; the longest prefix matching the true AR trajectory is accepted.
- For the real-active (RA) block, this accepted prefix is committed to the global KV-cache.
- Rejected suffixes of RA are recycled into a candidate pool for subsequent proposals.
- Pseudo-active blocks that reach a progress (spawn) threshold can trigger the creation of new pseudo-active blocks.
- Blocks may be promoted to RA when RA's accepted prefix is full ().
Termination occurs when the RA block accepts an <eos> token or all blocks reach full acceptance.
This approach allows multi-block parallelism under strict causal constraints, reusing rejected token sequences to maximize acceptance per iteration (Hu et al., 16 Dec 2025).
2. Pseudocode for Rejection Recycling
A condensed form of the algorithm is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Algorithm MultiBlock+RejectionRecycling Input: p_θ, block size n, max blocks K, spawn ratio r, max iters T_max Init: RA ← 1 for b=1…K: q_b ← random tokens of length n if b==RA else [] a_b ← [] N ← ∅, s ← ⌈r·n⌉ for t=1…T_max do batch ← [prompt; a_RA; q_RA, for each b≠RA: (a_b; q_b)] logits ← p_θ.forward(batch) for each b: g_b ← GreedyDecode(logits for q_b) m_b ← LongestPrefixMatch(g_b, AR_reference) a_b ← a_b ∥ g_b[1:m_b] if b==RA: reject_suffix ← g_b[m_b+1:] if reject_suffix ≠ []: N.insert(reject_suffix) CommitToKV(a_RA) if <eos> ∈ a_RA: return GeneratedSequence for each b: if b==RA: q_RA ← RefillDraftFromPoolOrRandom(N, n−|a_RA|) else: q_b ← pad(a_b) to n or [] if |a_RA| ≥ s and #blocks < K: spawn new pseudo-active block from RA if |a_RA| ≥ n: promote a pseudo-active block to RA end for return fallback AR decoding |
3. Key Formulas and Theoretical Analysis
At the heart of multi-block decoding with rejection recycling are several operational and theoretical formulas:
- Jacobi fixed-point update per block , iteration :
$y_{i,b}^{(j+1)} = \arg\max_{n} p_\theta(y\,|\, [\text{prompt}; a_b^{(j)}_{<i}; q_b^{(j)}_{\geq i}])$
for .
- Greedy Verification:
where is the AR-greedy continuation.
- Accepted-token count per iteration:
- Expected tokens-per-forward (TPF):
- End-to-end wall-clock speedup:
where is the sequence length and is the number of iterations used by the method.
For typical settings (e.g., –$128$, ), empirical acceptance probability –$0.07$, and factor-4–5 token acceptance improvement is observed compared to vanilla Jacobi, until limited by GPU parallelism constraints (Hu et al., 16 Dec 2025).
4. Compute–Latency Trade-offs and Hardware Considerations
Multi-block decoding with rejection recycling intentionally trades additional FLOPs (due to verifying and managing multiple concurrent and recycled proposals) to achieve lower token-level latency. Each pass processes proposal tokens plus up to candidates from the rejection pool. This extra computation is efficient and justifiable when the GPU has excess FLOPs budget up to its "roofline knee"—about 256 parallel tokens processed on H200/B200 GPUs and 128 on A100.
As increases, the per-iteration accepted token count improves super-linearly up to hardware limits, after which marginal speedup is diminished by roofline constraints. For typical Jacobi Forcing models under standard hyperparameters, this approach achieves a practical tokens-per-forward (TPF) increase from 1.00 (AR) to 4.09 (multi-block with recycling), with measured end-to-end speedups close to 4.0× (Hu et al., 16 Dec 2025).
5. Empirical Performance and Comparisons
Evaluations on HumanEval and math benchmarks validate the effectiveness of multi-block decoding with rejection recycling:
| Setting | AR | Jacobi (vanilla) | CLLM | Multi-block + Recycling (JFMR) |
|---|---|---|---|---|
| TPF (HumanEval) | 1.0 | 1.03 | 2.7 | 4.09 |
| Speedup (A100/B200) | 1.0 | 1.03 | 2.5 | 3.95–3.97 |
| pass@1 (HumanEval) | 87.8 | not stated | – | 83.5 |
| Math solve rate (GSM8K) | – | – | – | >91% |
| Speedup (math, Qwen2.5) | – | – | – | 3.7–3.8 |
In all tested scenarios, multi-block decoding with rejection recycling exceeds the throughput of both vanilla Jacobi and state-of-the-art diffusion LLMs (≤2.5× TPF, ≤1.8× speedup), while retaining a high fraction of the AR baseline's accuracy (Hu et al., 16 Dec 2025).
6. Hyperparameter Tuning
The efficiency of the method depends critically on several hyperparameters:
- : Block size; optimal near the hardware throughput knee (64 for H200/B200, 128 for A100).
- : Number of in-flight blocks; is generally optimal, higher yields diminishing returns.
- : Spawn threshold as a ratio of ; best results for .
- : Maximum candidate verification size; balances quality and compute.
- : Candidate pool size; recommended for GPU capacity.
Careful selection and dynamic adaptation of these parameters maximize throughput while avoiding inefficiencies or hardware over-saturation (Hu et al., 16 Dec 2025).
7. Implementation Guidelines and Practical Considerations
Effective deployment of multi-block decoding with rejection recycling requires:
- Batching all blocks with a noise-aware causal mask in each pass (avoid bidirectional attention).
- Ensuring KV-cache reuse by keeping the causal mask unchanged for RA blocks.
- Maintaining a FIFO structure for the rejected suffix pool; batch-verify candidates efficiently.
- Spawning new pseudo-active blocks only when the RA block surpasses its progress threshold, minimizing block fragmentation.
- Promoting blocks to RA only upon full acceptance in the prior RA slot.
- Tuning (, , , ) in accordance with hardware roofline studies; saturate but do not exceed FLOPs capacity.
- Utilizing mixed-precision and fused kernel implementations to reduce verification and overhead.
- Instituting fallback to greedy AR decoding in case maximum iteration limits () are reached to avoid potential infinite loops.
This suite of guidelines ensures both compatibility with the pretrained causal inference properties and practical speed and throughput benefits (Hu et al., 16 Dec 2025).