Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Block Decoding with Rejection Recycling

Updated 17 December 2025
  • The paper introduces a multi-block decoding method that leverages rejection recycling to significantly increase tokens accepted per iteration compared to traditional autoregressive and vanilla Jacobi approaches.
  • The method maintains key-value cache efficiency and uses parallel decoding under causal attention, making it compatible with modern GPU hardware and exploiting hardware-level parallelism.
  • Empirical evaluations report up to a 4× improvement in token acceptance and near 4× end-to-end speedup, demonstrating practical gains in efficiency and latency reduction.

Multi-block decoding with rejection recycling is an advanced methodology for accelerating inference in transformer-based LLMs, particularly those trained via Jacobi Forcing. This approach extends the Jacobi fixed-point paradigm by leveraging several blocks of parallel decoding with token-level recycling of rejected suffixes, thus significantly increasing the number of tokens accepted per iteration and reducing wall-clock latency relative to standard autoregressive (AR) or vanilla Jacobi schemes. The method is compatible with causal attention, preserves key-value (KV) cache efficiency, and is specifically tuned to exploit hardware-level parallelism in modern GPUs (Hu et al., 16 Dec 2025).

1. Formal Definition and Mechanism

Let LL be the total number of tokens to generate given a prompt. Choose a block size nn and a maximum number of in-flight blocks KK. At any iteration:

  • Maintain a set B={b1,...,bK}B = \{b_1, ..., b_K\} where each block bb stores:
    • qbVnq_b \in V^n: current draft (proposed tokens),
    • abVna_b \in V^{\leq n}: accepted prefix,
    • RA [1,K]\in [1, K]: the index of the real-active block; all others are pseudo-active.
  • In each step:

    1. All KK blocks are packed into a batch under causal attention.
    2. A single forward pass produces logits for each position in every block.
    3. For each block, the draft is "verified" using greedy decoding; the longest prefix matching the true AR trajectory is accepted.
    4. For the real-active (RA) block, this accepted prefix is committed to the global KV-cache.
    5. Rejected suffixes of RA are recycled into a candidate pool NN for subsequent proposals.
    6. Pseudo-active blocks that reach a progress (spawn) threshold can trigger the creation of new pseudo-active blocks.
    7. Blocks may be promoted to RA when RA's accepted prefix is full (aRAn|a_{RA}| \geq n).
  • Termination occurs when the RA block accepts an <eos> token or all blocks reach full acceptance.

This approach allows multi-block parallelism under strict causal constraints, reusing rejected token sequences to maximize acceptance per iteration (Hu et al., 16 Dec 2025).

2. Pseudocode for Rejection Recycling

A condensed form of the algorithm is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Algorithm MultiBlock+RejectionRecycling
Input: p_θ, block size n, max blocks K, spawn ratio r, max iters T_max
Init: RA  1
      for b=1K: q_b  random tokens of length n if b==RA else []
                  a_b  []
      N  , s  r·n

for t=1T_max do
    batch  [prompt; a_RA; q_RA, for each bRA: (a_b; q_b)]
    logits  p_θ.forward(batch)

    for each b:
        g_b  GreedyDecode(logits for q_b)
        m_b  LongestPrefixMatch(g_b, AR_reference)
        a_b  a_b  g_b[1:m_b]
        if b==RA:
            reject_suffix  g_b[m_b+1:]
            if reject_suffix  []:
                N.insert(reject_suffix)
    CommitToKV(a_RA)
    if <eos>  a_RA: return GeneratedSequence

    for each b:
        if b==RA: q_RA  RefillDraftFromPoolOrRandom(N, n|a_RA|)
        else: q_b  pad(a_b) to n or []
    if |a_RA|  s and #blocks < K:
        spawn new pseudo-active block from RA
    if |a_RA|  n:
        promote a pseudo-active block to RA
end for
return fallback AR decoding
This pseudocode demonstrates the integration of parallel decode, suffix recycling, candidate pool management, and block spawning/promotion (Hu et al., 16 Dec 2025).

3. Key Formulas and Theoretical Analysis

At the heart of multi-block decoding with rejection recycling are several operational and theoretical formulas:

  • Jacobi fixed-point update per block bb, iteration jj:

$y_{i,b}^{(j+1)} = \arg\max_{n} p_\theta(y\,|\, [\text{prompt}; a_b^{(j)}_{<i}; q_b^{(j)}_{\geq i}])$

for i=1ni=1 \dots n.

  • Greedy Verification:

gb=[g1,...,gn]:=argmax over each position of logitsg_b = [g_1, ..., g_n] := \arg\max \text{ over each position of logits}

mb=max{mn:g1:m=r1:m}m_b = \max\{m \leq n : g_{1:m} = r_{1:m}\}

where rr is the AR-greedy continuation.

  • Accepted-token count per iteration:

ΔT=b=1Kmb\Delta T = \sum_{b=1}^K m_b

  • Expected tokens-per-forward (TPF):

E[mb]=i=1npb,i\mathbb{E}[m_b] = \sum_{i=1}^n p_{b,i}

E[ΔT]=b=1Ki=1npb,i\mathbb{E}[\Delta T] = \sum_{b=1}^K \sum_{i=1}^n p_{b,i}

  • End-to-end wall-clock speedup:

SpeedupLT\text{Speedup} \approx \frac{L}{T}

where LL is the sequence length and TT is the number of iterations used by the method.

For typical settings (e.g., n=64n=64–$128$, K=2K=2), empirical acceptance probability p0.03p\approx0.03–$0.07$, and factor-4–5 token acceptance improvement is observed compared to vanilla Jacobi, until limited by GPU parallelism constraints (Hu et al., 16 Dec 2025).

4. Compute–Latency Trade-offs and Hardware Considerations

Multi-block decoding with rejection recycling intentionally trades additional FLOPs (due to verifying and managing multiple concurrent and recycled proposals) to achieve lower token-level latency. Each pass processes KnK\cdot n proposal tokens plus up to N|N| candidates from the rejection pool. This extra computation is efficient and justifiable when the GPU has excess FLOPs budget up to its "roofline knee"—about 256 parallel tokens processed on H200/B200 GPUs and 128 on A100.

As KK increases, the per-iteration accepted token count ΔT\Delta T improves super-linearly up to hardware limits, after which marginal speedup is diminished by roofline constraints. For typical Jacobi Forcing models under standard hyperparameters, this approach achieves a practical tokens-per-forward (TPF) increase from 1.00 (AR) to 4.09 (multi-block with recycling), with measured end-to-end speedups close to 4.0× (Hu et al., 16 Dec 2025).

5. Empirical Performance and Comparisons

Evaluations on HumanEval and math benchmarks validate the effectiveness of multi-block decoding with rejection recycling:

Setting AR Jacobi (vanilla) CLLM Multi-block + Recycling (JFMR)
TPF (HumanEval) 1.0 1.03 2.7 4.09
Speedup (A100/B200) 1.0 1.03 2.5 3.95–3.97
pass@1 (HumanEval) 87.8 not stated 83.5
Math solve rate (GSM8K) >91%
Speedup (math, Qwen2.5) 3.7–3.8

In all tested scenarios, multi-block decoding with rejection recycling exceeds the throughput of both vanilla Jacobi and state-of-the-art diffusion LLMs (≤2.5× TPF, ≤1.8× speedup), while retaining a high fraction of the AR baseline's accuracy (Hu et al., 16 Dec 2025).

6. Hyperparameter Tuning

The efficiency of the method depends critically on several hyperparameters:

  • nn: Block size; optimal near the hardware throughput knee (64 for H200/B200, 128 for A100).
  • KK: Number of in-flight blocks; K=2K=2 is generally optimal, higher KK yields diminishing returns.
  • rr: Spawn threshold as a ratio of nn; best results for r0.85r \approx 0.85.
  • mm: Maximum candidate verification size; m=4m=4 balances quality and compute.
  • N|N|: Candidate pool size; N256|N|\lesssim256 recommended for GPU capacity.

Careful selection and dynamic adaptation of these parameters maximize throughput while avoiding inefficiencies or hardware over-saturation (Hu et al., 16 Dec 2025).

7. Implementation Guidelines and Practical Considerations

Effective deployment of multi-block decoding with rejection recycling requires:

  • Batching all KK blocks with a noise-aware causal mask in each pass (avoid bidirectional attention).
  • Ensuring KV-cache reuse by keeping the causal mask unchanged for RA blocks.
  • Maintaining a FIFO structure for the rejected suffix pool; batch-verify candidates efficiently.
  • Spawning new pseudo-active blocks only when the RA block surpasses its progress threshold, minimizing block fragmentation.
  • Promoting blocks to RA only upon full acceptance in the prior RA slot.
  • Tuning (nn, mm, KK, rr) in accordance with hardware roofline studies; saturate but do not exceed FLOPs capacity.
  • Utilizing mixed-precision and fused kernel implementations to reduce verification and overhead.
  • Instituting fallback to greedy AR decoding in case maximum iteration limits (TmaxT_{max}) are reached to avoid potential infinite loops.

This suite of guidelines ensures both compatibility with the pretrained causal inference properties and practical speed and throughput benefits (Hu et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Block Decoding with Rejection Recycling.