Papers
Topics
Authors
Recent
Search
2000 character limit reached

Blockwise Transformer Compression

Updated 6 May 2026
  • The paper demonstrates that blockwise quantization achieves up to 8× compression with under 1% accuracy loss on benchmarks like GLUE.
  • BCT partitions weights, activations, and embeddings into small blocks, applying adaptive low-bit quantization tailored to local statistics.
  • The method requires no retraining, supports integer-only inference, and streamlines deployment for large-scale Transformer models.

Blockwise Compression of Transformers (BCT) is a post-training quantization and compression framework designed to address the computation and memory bottlenecks inherent in large-scale Transformer models such as BERT, GPT-3, and GPT-4. Unlike traditional layerwise quantization methods, BCT operates at the granularity of small sub-tensor “blocks,” thereby significantly improving the efficiency-accuracy trade-off without requiring retraining or fine-tuning of model parameters. BCT achieves high compression ratios by partitioning all tensors—weights, activations, embeddings, and nonlinearities—into blocks and compressing each block using adaptive, low-bit quantization, optionally enhanced by entropy coding. Experimental results indicate that BCT can compress models by up to 8× with a sub-1% accuracy drop on benchmarks like GLUE, surpassing conventional uniform quantization in both fidelity and deployment simplicity (Dong et al., 2023, Dong et al., 2023).

1. Motivation and Principle of Blockwise Compression

Transformers’ state-of-the-art performance in natural language processing comes at the expense of prohibitive computational resources and memory usage. Large models typically require hundreds of gigabytes of storage and at least 102310^{23} floating-point operations for inference (Dong et al., 2023). Layerwise post-training quantization (PTQ) methods, which assign a single scale or bit-width per entire tensor, are prone to significant accuracy degradation. This is due to their inability to adapt to the nonuniform statistical distribution found across different tensor subregions; blocks with low dynamic range are over-quantized, while high-variance blocks are under-served. BCT mitigates these distribution mismatches by partitioning tensors into small, non-overlapping blocks and quantizing each block separately, thus more closely matching local statistics and minimizing quantization error.

BCT is explicitly designed for post-training application—no gradient-based retraining or output calibration is needed. The approach is fully compatible with integer-only or low-bit floating-point inference, tailored for minimal latency and maximal throughput.

2. Blockwise Partitioning and Quantization Methodology

Block Partitioning Strategy

Given a weight matrix WRM×NW\in\mathbb{R}^{M\times N} or activation tensor, a block size bb is selected (e.g., b=64b=64), balancing hardware efficiency and statistical representation. The matrix is partitioned into blocks Wi,jRb×bW_{i,j} \in \mathbb{R}^{b\times b}, creating up to Mb×Nb\frac{M}{b}\times\frac{N}{b} independent quantization domains. Activations are partitioned analogously.

Quantization Process

Each block Wi,jW_{i,j} undergoes separate quantization using either a scale-round or a shift-based method:

  • Scale-round: For each block, compute local scale αi,j\alpha_{i,j}:

αi,j=maxu,vWi,j[u,v]2k11\alpha_{i,j} = \frac{\max_{u,v} |W_{i,j}[u,v]|}{2^{k-1} - 1}

The entries are quantized and clipped to kk bits, and dequantized via WRM×NW\in\mathbb{R}^{M\times N}0.

  • Shift-based (hardware-friendly): For each block, derive a bit-shift parameter reflecting the maximum representable range; quantization becomes a simple bit shift and clipping operation, improving integer inference throughput (Dong et al., 2023).

Blockwise scales or shifts enable optimal utilization of the bit-width per block, drastically reducing mean squared quantization error relative to layerwise approaches:

WRM×NW\in\mathbb{R}^{M\times N}1

This local adaptation ensures each block’s dynamic range is fully covered by the available bit levels.

Comparison of Quantization Granularities

Quantization Granularity Bit-width Adaptivity Need for Retraining Quantization Error
Layerwise Global (Low) Often Needed High
Blockwise (BCT) Local (High) Not Required Low

Blockwise approaches track local tensor statistics, in contrast to the coarse, global scaling of layerwise quantization.

3. Compression of Transformer Components

BCT applies blockwise quantization across all model components:

  • Embeddings: Token and positional embeddings are partitioned and quantized.
  • Attention Weights: Each of the multi-head attention matrices (Q, K, V, output) is blockwise quantized.
  • Feed-forward Networks: All linear and bias parameters in intermediate MLPs are processed identically.
  • Activations: Intermediate tensors—including those passed to Softmax, GELU, and LayerNorm—are re-quantized at block granularity before each subsequent operation.
  • Nonlinearities: Function approximations (e.g., GELU, WRM×NW\in\mathbb{R}^{M\times N}2, WRM×NW\in\mathbb{R}^{M\times N}3 in Softmax) use low-bit lookup tables with possible linear interpolation, further reducing memory and compute.
  • Accumulator Alignment: Blockwise quantized matrix products require local exponent (shift) alignment before addition, implemented via simple bit shifts (Dong et al., 2023).

This unified approach achieves model-wide compression while ensuring numerical stability and execution efficiency.

4. Procedural Workflow and Inference Integration

BCT operates in two phases:

  • Model Compression:
  1. Partition each tensor into WRM×NW\in\mathbb{R}^{M\times N}4 blocks.
  2. For each block, compute the scaling or shift parameter.
  3. Quantize and store the integerized parameters along with per-block metadata (scales/shifts).
  • Inference:
  1. Inputs are cast or quantized into blocks.
  2. Each layer operates on blockwise integer (or low-bit float) representations.
  3. In matrix multiplications, local shifts are aligned; elementwise operations reference quantized function tables.
  4. Final outputs are decoded to floating-point as required (Dong et al., 2023).

No retraining or calibration is performed. The model graph structure is unchanged—only the parameter storage and arithmetic are modified, facilitating drop-in deployment on standard hardware (e.g., ONNX, TensorRT backends).

5. Empirical Results and Comparative Analysis

Empirical evaluation over the General Language Understanding Evaluation (GLUE) tasks demonstrates the efficacy of blockwise compression:

Model Average Bit-width Compression Ratio SST-2 Acc. MNLI Acc. WRM×NW\in\mathbb{R}^{M\times N}5AccWRM×NW\in\mathbb{R}^{M\times N}6
BCT_int4/8 4.5 WRM×NW\in\mathbb{R}^{M\times N}7 90.94% 80.08% −0.80%
Q8BERT 8 WRM×NW\in\mathbb{R}^{M\times N}8 91.61% N/A −0.13%
BERTWRM×NW\in\mathbb{R}^{M\times N}9 32 bb0 91.74% 83.61% 0 (baseline)

BCT matches or exceeds the compression ratio and accuracy retention of full-quantization (FQ-BERT) and Q8BERT baselines, without the need for retraining. The observed accuracy drop is consistently under 1% even at nearly 8× reduction in model size (Dong et al., 2023). Further, BCT_fp8 (full model quantized to fp8) exhibits no measurable accuracy loss.

Latency measurements suggest a CPU inference speed-up of approximately 40% (3.5 ms to 2.1 ms) and a GPU speed-up of 25% (1.2 ms to 0.9 ms) for BBCT-style compression (Dong et al., 2023).

6. Advantages, Limitations, and Theoretical Considerations

Advantages

  • Distribution-Matched Scaling: Per-block scales or shifts accurately match local tensor statistics, reducing the quantization error and dataset distribution shift.
  • No Retraining Required: Empirical results indicate sufficiently low quantization noise that model accuracy is retained without fine-tuning.
  • Hardware Compatibility: The approach supports integer-only or low-bit float inference, enabling efficient deployment.
  • Universality: BCT extends to all transformer-style architectures due to the uniformity of their computational primitives (Dong et al., 2023).

Limitations

  • Block Size Trade-off: Excessively small blocks increase metadata overhead; large blocks lose statistical adaptation. A block size of 64 balances these effects.
  • Quantization of Softmax: Activation quantization within the self-attention Softmax below 4-bit can be numerically unstable.
  • Lookup Table Overhead: Nonlinearities require lookup tables, but these are compact (typically 256 entries).
  • Hardware Tiling Constraints: Efficient deployment requires block sizes matched to backend memory tiling.

7. Extensions and Application Scope

BCT is compatible with extensions such as mixed-precision quantization (e.g., fp8 for sensitive layers) and optional entropy coding (Huffman or arithmetic coding) after blockwise quantization, enabling minor additional compression gains at the cost of modest decode complexity (Dong et al., 2023). BCT is well-suited for deployment in resource-constrained environments, facilitating real-time inference on large models without retraining prerequisites. The method is readily integrated into major model deployment toolchains through quantized linear operator kernels that decode blockwise parameters on-the-fly, with minimal hardware or software refactoring.


Blockwise Compression of Transformers, as exemplified by BCT and BBCT, demonstrates that fine-grained, post-training quantization provides a scalable and practical solution to the challenge of deploying resource-intensive Transformer models while preserving accuracy and minimizing engineering overhead (Dong et al., 2023, Dong et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blockwise Compression of Transformers (BCT).