Blockwise Transformers

Updated 3 March 2026

Blockwise Transformers are architectures that partition input sequences into fixed-size blocks to reduce quadratic memory and compute costs.
They achieve efficient self-attention by processing contiguous token blocks with parallelism and exact softmax methods for long contexts.
These designs underpin advancements in language modeling, vision, streaming, and model compression, offering substantial speedups and memory savings.

Blockwise Transformers are a class of architectures and algorithmic strategies that reorganize standard Transformer computations—especially self-attention and feedforward sublayers—into operations over contiguous blocks of tokens. This restructuring is motivated by the prohibitive memory and computational demands of vanilla Transformers on long sequences, bringing both practical linear scaling and new model parallelization capabilities while enabling exact or controlled-approximate computation. Blockwise designs underpin a variety of recent advances in long-context language modeling, scalable vision architectures, streaming sequence processing, and efficient model compression.

1. Core Concepts and Problem Motivation

Transformers compute self-attention as a dense $O(N^2d)$ operation over sequences of length $N$ , incurring quadratic memory and compute costs that grow rapidly with context size (Liu et al., 2023, Liu et al., 2023, Qiu et al., 2019). This limitation constrains both model training and inference, especially for use cases requiring extended context (e.g., document-level or video understanding, RL with long experience windows). Blockwise Transformers address this by partitioning the input sequence into $T=N/B$ contiguous, fixed-size blocks, and re-expressing attention and feed-forward networks in terms of operations over these blocks.

Key strategies include:

Blockwise computation: Partition $Q,K,V$ projections into blocks to process attention in slices, never materializing the full $N \times N$ score matrix.
Blockwise memory and streaming: Retain only block-local intermediate states, reducing activation storage from $O(N^2)$ to $O(N)$ or $O(BN)$ per layer.
Blockwise parallelism: Schedule computation and communication over device meshes or rings, enabling large contexts across distributed resources (Liu et al., 2023).
Blockwise sparsity: Employ block-diagonal or block-permuted attention patterns for further memory and compute reduction, at modest modeling cost (Qiu et al., 2019).

2. Blockwise Attention Mechanisms and Exactness

Blockwise self-attention variants preserve exact softmax attention by leveraging incremental, block-local updates to softmax numerators and denominators. For example, in Blockwise Parallel Transformer (BPT), for each query block $Q_i$ of size $B$ , attention proceeds by iteratively accumulating the contribution of every $N$ 0 block: $N$ 1 where $N$ 2 ensures numerical stability. After all blocks, $N$ 3. This form is memory-equivalent to FlashAttention-style kernels and retains full expressivity (Liu et al., 2023, Liu et al., 2023).

Blockwise attention unlocks critical memory savings:

Vanilla transformer: $N$ 4 activations per layer.
Blockwise: $N$ 5 for projections plus $N$ 6 for accumulators. For moderate $N$ 7, this is a decisive reduction.

Blockwise attention can also be fused with immediately local feedforward computation, further capping peak memory at $N$ 8 per block (Liu et al., 2023).

3. Ring Attention and Device-Parallel Blockwise Scaling

Ring Attention extends blockwise Transformers by sharding blocks across $N$ 9 devices, each hosting one $T=N/B$ 0 block. To compute self-attention over all $T=N/B$ 1, devices are arranged in a logical ring; at each of $T=N/B$ 2 steps, every device concurrently:

Processes the current KV block and updates accumulators,
Sends its local $T=N/B$ 3 to successor, and receives a new $T=N/B$ 4 from predecessor,
Fully overlaps communication and blockwise compute.

This design achieves sequence length scaling linear in device count ( $T=N/B$ 5), with total memory and communication costs per device independent of total sequence length: $T=N/B$ 6 Block size is chosen such that $T=N/B$ 7 (device flop/comm bandwidth), which for modern devices is efficiently satisfied for $T=N/B$ 8 (Liu et al., 2023).

Empirically, this enables training and inference for millions of tokens (e.g., 4M tokens on 1024 TPUv4 with a 13B model) without reliance on attention approximations or increased overhead, while maintaining near-ideal model FLOP utilization and scalable throughput.

4. Blockwise Variants and Hybrid Architectures

Blockwise designs are employed in several complementary approaches:

Sparse Blockwise Attention: Techniques such as BlockBERT replace the dense $T=N/B$ 9 attention matrix by a union of local, block-diagonal submatrices and a small set of global or off-diagonal connections. This reduces time and space complexity by a factor of $Q,K,V$ 0, with empirical pretraining speedups of up to 25% and 36% peak memory savings (negligible accuracy drop) (Qiu et al., 2019).
Block-Recurrent Transformers: Models process blocks sequentially, maintaining recurrent states across blocks. Each block applies a vertical (self/cross) attention over the block, and horizontal (self/cross) attention over the states. LSTM-style gating provides memory, and parallelism is retained within block. Perplexity improvements and order-of-magnitude context length scaling are observed over baseline XL-like models (Hutchins et al., 2022).
Block-State Transformers (BST): Hybridizes blockwise self-attention (for local context) with state-space models (for global/infinite context). Block-local context states are integrated via cross-attention, and extensive parallelism is achieved across sequence blocks. BST outperforms comparable architectures on language modeling of long sequences and delivers $Q,K,V$ 1 speedups at the layer level for long contexts on modern hardware (Fathi et al., 2023).

The following table summarizes key characteristics of selected blockwise strategies:

Method/Variant	Context Scale	Memory Complexity	Blockwise Exact	Parallelism	Reference
Blockwise Parallel Transformer	$Q,K,V$ 2 vanilla	$Q,K,V$ 3	Yes	Per-block	(Liu et al., 2023)
Ring Attention	$Q,K,V$ 4 baseline	$Q,K,V$ 5 per device	Yes	Device/Ring-wise	(Liu et al., 2023)
BlockBERT	$Q,K,V$ 6 blocks	$Q,K,V$ 7	Sparse	Per-block/free head	(Qiu et al., 2019)
Block-Recurrent Transformer	$Q,K,V$ 8	$Q,K,V$ 9	Yes	Per-block	(Hutchins et al., 2022)
Block-State Transformer	Subquadratic ( $N \times N$ 0 SSM)	$N \times N$ 1	Yes	Per-block, SSM global	(Fathi et al., 2023)

5. Blockwise Transformers in Streaming and Compression

Blockwise schemes also underpin streaming and blockwise-compressed Transformer designs:

Blockwise Streaming Processing: Contextual block processing with inheritance is foundational for online ASR, SLU, and simultaneous speech translation. At each layer and block, a short "context embedding" from the previous block is appended, enabling modeling of global information with minimal added memory. Masks ensure no future context is seen, and left-context inheritance supports distributed, low-latency deployment (Deng et al., 2022, Tsunoo et al., 2019).
Blockwise Model Compression (BCT): Compression frameworks such as BCT partition all weights and activations into small blocks, applying independent low-bit quantization schemes per block (e.g., 4/8-bit elements plus per-block scales). All nonlinearities (GELU, Softmax, LayerNorm) are compressed accordingly using lookup tables and interpolation. BCT achieves up to $N \times N$ 2 compression with minimal GLUE accuracy drops ( $N \times N$ 3) and—unlike layerwise quantization—requires no retraining (Dong et al., 2023).

6. Blockwise Self-Supervised and Learning Dynamics

Blockwise self-supervised learning (BWSSL) applies local objectives per block without end-to-end gradient flow. In masked video transformers, the encoder is split into $N \times N$ 4 blocks, each with a local decoder and loss. Gradients are stopped between blocks, and only local block parameters are updated per loss. Empirical findings on VideoMAE-style ViTs show:

BWSSL converges reliably and matches end-to-end masked autoencoding on linear-probe and retrieval proxies,
High-level features become linearly accessible at mid-depth earlier under BWSSL than under end-to-end training, as shown via probing and centered kernel alignment (CKA) analyses,
Saturation and geometry preservation arise in later blocks, with diminished marginal gains—implicating stabilization interfaces as limiting factors for further local objective-driven progress (Römer et al., 14 Jan 2026).

7. Blockwise Parallel Decoding

Blockwise strategies accelerate autoregressive decoding by predicting $N \times N$ 5 tokens per forward pass via parallel "proposal" heads, then verifying and accepting the longest correct prefix. This reduces the number of decode steps by factors up to $N \times N$ 6 in practice (with minimal BLEU or image quality loss), and achieves up to $N \times N$ 7 real-world speedup in both MT and image super-resolution tasks. The core algorithmic change requires only a small multi-output projection in the decoder and a modified inference loop, without modifications to encoder or attention (Stern et al., 2018).

Blockwise Transformers comprise a broad methodological axis encompassing parallel memory-efficient attention, model partitioning, hybridization with recurrent or state-space modules, and block-level quantization or streaming. Recent advances achieve near-infinite context scaling, high-throughput training, low-latency streaming, and sublinear memory footprints, enabling previously intractable sequence modeling workloads across modalities and application domains (Liu et al., 2023, Liu et al., 2023, Qiu et al., 2019, Fathi et al., 2023, Hutchins et al., 2022, Deng et al., 2022, Dong et al., 2023, Römer et al., 14 Jan 2026, Stern et al., 2018).