Papers
Topics
Authors
Recent
2000 character limit reached

Blockwise Parallel Transformer (BPT)

Updated 18 January 2026
  • Blockwise Parallel Transformer (BPT) is a family of methods that partition computations into blocks to enhance scalability, memory efficiency, and inference speed.
  • Blockwise parallel decoding interleaves parallel draft prediction with serial verification, achieving 3–4× speedups with minimal BLEU loss in autoregressive tasks.
  • Blockwise tiling and sparse/hybrid attention reduce memory overhead and computational costs, enabling efficient processing of very long sequences.

The Blockwise Parallel Transformer (BPT) family constitutes a set of architectural and algorithmic innovations aimed at improving the scalability, memory efficiency, and inference speed of Transformer models. These methods partition computation into blocks—either over tokens, positions, or decoding steps—to facilitate parallelism, memory reduction, and/or sub-quadratic operations. Implementations and naming conventions for BPT are not uniform; the term refers to distinct, rigorously documented approaches including blockwise parallel decoding for autoregressive inference, block-based computational tiling for efficient attention/FFN, hybrid blockwise sparse attention mechanisms, and locally recurrent Transformer variants.

1. Blockwise Parallel Decoding: Algorithmic Foundations

Blockwise Parallel Decoding (BPD), sometimes referred to as Blockwise Parallel Transformer inference, targets the inherent sequential bottleneck in autoregressive Transformer decoders. The method interleaves blockwise parallel draft prediction with a serial verification step to accelerate generation without sacrificing output consistency with greedy decoding.

Given input xx and target sequence y=(y1,,ym)y = (y_1, \ldots, y_m), BPD employs a base predictor p1(yj+1yj,x)p_1(y_{j+1}|y_{\leq j}, x) together with k1k-1 auxiliary proposal heads p2,,pkp_2, \ldots, p_k, each attempting to predict further steps ahead conditioned only on the same prefix yjy_{\leq j}. At each decoding cycle:

  • All kk proposals {y^j+i}i=1k\{\hat{y}_{j+i}\}_{i=1}^k are predicted in parallel.
  • The base model, conditioning on the growing prefix (including prior accepted proposals in the block), verifies each proposal by recomputing p1p_1 logits and comparing to proposals.
  • The process identifies the largest prefix ii' where all y^j+i\hat{y}_{j+i} match the greedy prediction, defining the block size k^\hat{k} for that step.
  • The accepted tokens are appended, advancing the decoding position by k^\hat{k}.

The essential equations for verification and acceptance are:

k^=max{i{1,,k}:1ii,y^j+i=argmaxnp1(yj+i=ny^j+i1,x)},\hat{k} = \max \left\{i \in \{1, \dots, k\} : \forall 1 \leq i' \leq i, \hat{y}_{j+i'} = \arg\max_{n} p_1(y_{j+i'}=n|\hat{y}_{\leq j+i'-1}, x)\right\},

where at least one token per iteration is guaranteed. A joint output head can compress both prediction and verification passes, halving model invocations per sequence (Stern et al., 2018, Kim et al., 2024).

Typical BPT pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function BPD_Decode(M_base, M_block, x, T, h):
    t  1
    y  []
    while t  T:
        z_block[1..h]  M_block(x, y)
        for j in 1..h:
            draft[j]  argmax(z_block[j])
        for j in 1..h parallel:
            z_base[j]  M_base(x, y + draft[1..(j-1)])
            verify_ok[j]  (argmax(z_base[j]) == draft[j])
        n = max {n' : all verify_ok[1..n'] are True}
        y += draft[1..n]
        t += n
    return y
(Kim et al., 2024)

2. Blockwise Partitioning in Self-Attention and FFN

Separate from blockwise decoding, several works use blockwise partitioning to reduce memory/compute overhead during training or inference:

  • Double-blocking for Standard Attention: (Liu et al., 2023) divides the input sequence into BqB_q query blocks of size cqc_q and BkvB_{kv} key/value blocks of size ckc_k. The cq×ckc_q \times c_k attention is computed for each (Qi,Kj,Vj)(Q_i,K_j,V_j) triplet, accumulating numerators and denominators for the softmax. After all key/value blocks, numerators and denominators are unified using max-plus stabilization.
  • Per-block FFN Fusion: After each attention block, the two-layer FFN is immediately applied to each block, further reducing activation memory.

This approach changes the activation memory profile from O(s2)O(s^2) (vanilla) or $8bsh$ (Flash) to $2bsh$ per layer, thus scaling to up to 32× longer sequences than standard Transformers, and up to 4× longer than previous memory-efficient exact attention methods—even in LLMs (1B–70B parameters) (Liu et al., 2023).

3. Blockwise Sparse and Hybrid Attention Mechanisms

The BP-Transformer (BP-T) (Ye et al., 2019) and similar architectures leverage blockwise or hierarchical graph sparsity to support very long-range context:

  • Binary Partition Graph: The input sequence (length nn) is represented as leaves of a perfect binary tree, with internal span nodes. Each token is connected at multiple scales through "contextual edges" (attending to up to kk neighbors at each tree level) and fused with "affiliated edges".
  • Graph Self-Attention Computation: Each node performs multi-head attention over only its selected neighbors, not the full nn-token sequence. Each self-attention layer thus costs O(hdknlog(n/k))O(hdk n \log(n/k)) rather than O(hdn2)O(hdn^2).
  • CUDA Implementation: The adjacency information is precomputed; sparse gather-softmax-scatter operations are fused in custom CUDA kernels, holding only O(knlog(n/k))O(k n \log(n/k)) edges.

This mechanism enables sequence lengths unattainable by quadratic-cost attention, with empirical evidence of linear memory scaling and strong performance on both document-level translation and language modeling (Ye et al., 2019).

4. Blockwise Parallelism with Local-Global Mixing

Variants such as those in ViRanker (Dang et al., 11 Sep 2025) and related work implement blockwise parallel computation in which each layer is split into BB independent blocks of size mm, each handling local self-attention and FFN in parallel. To enable global information flow:

  • A summary vector s(b)s^{(b)} is computed for each block (typically via mean pooling or projection).
  • All blocks can attend to this global summary matrix via additional key/value projections, providing cross-block mixing with O(B2p)O(B^2 p) memory where pmp \ll m.
  • Local and global attention outputs are merged and passed to the FFN.

Extensions include swapping absolute positional encoding for within-block rotary position encoding, which can improve model capacity for morphologically complex and low-resource languages (Dang et al., 11 Sep 2025).

5. Rescoring and Draft Refinement in Blockwise Decoding

Recent analyses of blockwise parallel decoding drafts identify two phenomena affecting block efficiency:

  • Consecutive-token Repetition: Consecutive proposal heads, using the same context, frequently output identical tokens, leading to long repeated sequences in block drafts.
  • Confidence Decay: The entropy of the softmax logits increases across heads (i.e., draft quality decays deeper in the block).

To address these, refinement procedures are used:

  • Neural Rescoring: A small local autoregressive model scores top-kk proposals per head, and logits are interpolated before reselecting drafts.
  • n-gram Rescoring: Global sausage-lattice enumeration is used with n-gram models via composition in OpenFST, enabling selection of globally consistent blocks.

Empirical evaluation indicates +5%+5\% to +21%+21\% increases in block efficiency on hard datasets, with minor increases in compute/memory overhead and no regression in output quality, since the verification subroutine retains exact greedy equivalence (Kim et al., 2024).

6. Training, Memory Complexity, and Experimental Results

A range of BPT instantiations show strong empirical improvements:

  • Blockwise Decoding: On WMT’14 En→De, block sizes of k=48k=4 \sim 8 yield mean accepted block sizes kˉ=3.274.69\bar{k}=3.27 \sim 4.69 and 2.7×–3.3× wall-clock speedups compared to standard greedy decoding, with BLEU score drop <1.2<1.2 points (Stern et al., 2018).
  • Blockwise Tiling for Long Contexts: Single-GPU training sequence length increases from vanilla 16K to BPT 131K (GPT-1B on A100), up to 79GB activation memory for $131K$ context length (Liu et al., 2023).
  • BP-Transformer: On IMDB and EnWiki8/Text8, BPT achieves both higher accuracy and better bits-per-character for long sequences, with memory remaining linear and throughput steady for sequence lengths up to 8K (Ye et al., 2019).
  • Practical Recommendations: For blockwise decoding, block sizes k=48k=4\ldots8 are ideal, and tuning through sequence-level distillation or rescoring yields best efficiency. For blockwise attention, block size choice is hardware- and task-dependent.
BPT Variant Key Advantage Complexity
Decoding (BPD) Wall-clock speedup (3–4×) m/kˉ\sim m/\bar{k}
Block tiling (2305) Long context, mem. savings O(s)O(s) per layer
BP-Transformers (1911) Multi-scale context, sparse O(knlog(n/k))O(k n\log (n/k))

Several further variants exploit blockwise parallelism:

  • Block-Recurrent Transformers (Hutchins et al., 2022) apply gated recurrence over blocks, processing length-WW token segments fully in parallel while updating block-level state vectors recurrently. This method achieves O(NW)O(N\,W) complexity, supporting very long-range dependencies at constant or reduced runtime cost compared to Transformer-XL.
  • Extensions and Open Problems: The blockwise pattern generalizes to encoder-decoder architectures and cross-attention, with additional gains plausible through fusion with mixture-of-experts (MoE) FFN or hybrid sparse-dense attention patterns (Liu et al., 2023). Optimizing block partitioning per layer or for hardware remains an open research direction.

8. Conclusion

Blockwise Parallel Transformer methods collectively enable large reductions in attention and FFN memory scaling, substantial sequential speedups for autoregressive decoding, and support for extremely long context windows. Their practical design principles include local blockwise attention with global summary routing, independent draft heads with verification, and minimal-to-moderate modifications to standard Transformer architectures. These advances have been rigorously validated both on language modeling and sequence generation tasks, as well as in multilingual retrieval and large-scale reinforcement learning, and can be implemented in standard deep learning frameworks using blockwise tensor tiling and sparse CUDA kernels (Stern et al., 2018, Ye et al., 2019, Liu et al., 2023, Kim et al., 2024, Dang et al., 11 Sep 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blockwise Parallel Transformer (BPT).