Blockwise Quantization Formats
- Blockwise quantization is a numerical method that partitions vectors or tensors into fixed blocks, applying a shared scale to reduce quantization error.
- The approach optimizes compression and hardware efficiency by adapting bit allocation based on local statistics and using methodologies like block floating-point and adaptive transforms.
- Applications include deep neural network inference, training, and communications, achieving low-bit deployment with minimal accuracy degradation.
Blockwise quantization data formats are a class of numerical representations for vectors and tensors that group contiguous elements into blocks, applying shared quantization parameters within each block. This structure is designed to optimize the balance between quantization error, compression ratio, and hardware efficiency, with broad application across deep neural network inference, training, and communication. Blockwise quantization includes formats such as block floating-point (BFP), block-int with shared scale, block-scaled number formats, and methods integrating blockwise clustering or adaptive transforms. The approach excels in minimizing dynamic range problems and stochastic quantization error by leveraging local statistics, enabling efficient low-bit deployment without significant accuracy loss.
1. Core Principles and Formal Definition
In blockwise quantization, a high-dimensional tensor (e.g., weight matrix, activation vector, or gradient) is partitioned into non-overlapping blocks of fixed size . Each block is quantized independently using a shared scale parameter ("block scale"), typically determined by RMS, absmax, or another tailored statistic:
- Blockwise quantization mapping:
- For a block , compute scale (absmax, RMS, etc.).
- Normalize and quantize each entry: , where is the codebook, producing e.g., -bit element codes.
- Store the codes and the per-block scale.
The total bits per parameter are , where is the bitwidth for scale encoding.
This construct generalizes to diverse blockwise formats:
- (SBFP/BFP): Block floating point with exact or power-of-two shared scales (Soloveychik et al., 2022).
- Block-integer + scale: As in blockwise int4/int8 (Orr et al., 19 May 2025, Khodamoradi et al., 2024, Dong et al., 2023).
- Blockwise adaptive transforms: Blockwise preconditioning or orthogonal transforms followed by quantization (Chen et al., 30 Nov 2025).
- Blockwise clustering/codebooks: Individual codebook assignment per block for improved representational efficiency (Elangovan et al., 7 Feb 2025).
Variable-length encoding naturally emerges at block granularity—blocks with large or outlier-dominated values require more bits on average due to their scale representation, effectively providing adaptive bit allocation (Orr et al., 19 May 2025).
2. Theoretical Framework and Error Analysis
The design and analysis of blockwise formats rely on a statistical and information-theoretic foundation:
- KL divergence and squared-error: Optimal quantization minimizes , constrained by a total bit budget (Orr et al., 19 May 2025). In the local quadratic regime, this reduces to minimizing Fisher-weighted error.
- Block scaling and error: Blockwise scaling exploits local value distribution to reduce mean-squared quantization error for a fixed bit rate, outperforming fixed-scale quantization under typical data distributions (Gaussian, Laplace, Student-t). Analytically, SBFP with optimal scaling yields for -bit mantissas; practical block scaling in blockwise int4 formats achieves low KL divergence and accuracy degradation in LLMs (Soloveychik et al., 2022, Orr et al., 19 May 2025, Khodamoradi et al., 2024).
- Bit allocation across tensors: The optimal average bits per tensor in a multi-tensor model is
yielding up to $0.25$ bits/param savings for large models (Orr et al., 19 May 2025).
- Block size optimization: For a given mantissa width, the optimal block size is determined by balancing quantization noise and scaling error. E.g., for bits, (Soloveychik et al., 2022).
3. Data Formats, Metadata, and Storage Layout
Blockwise formats are distinguished by their local metadata structure and exact bitfield organization. The most common patterns are illustrated in the table:
| Format (example) | Data Payload | Per-Block Metadata |
|---|---|---|
| BFP, SBFP | B -bit mantissas | Shared exponent or scale |
| Block integer + scale | B -int | bits |
| Clustered quant. | B -codes | Codebook selector, scale |
| Multi-level block FP | B -bit float-like | 1-3 scales, micro-exponents |
Memory layouts typically pack all code fields contiguously, followed by per-block scales (and, if needed, codebook selectors or further micro-exponents) (Luo et al., 11 Feb 2026, Dong et al., 2023, Elangovan et al., 7 Feb 2025).
Floating-point block formats (e.g., HiFloat4, BFP, SBFP) may add multiple scaling levels—HiF4, for example, uses global/8-local/16-finer exponents for 64 elements, blending inter/intra-group dynamic range compression (Luo et al., 11 Feb 2026). Codebook-based blockwise formats (e.g., BCQ) store scalar-to-codebook indices and per-block codebook IDs for locally optimal codebook utilization (Elangovan et al., 7 Feb 2025).
4. Quantization, Dequantization, and Algorithmic Implementation
The quantization and dequantization process in blockwise formats is uniform:
- Quantization:
- Compute per-block statistics (absmax, RMS, etc.).
- Normalize entries, quantize to codebook or integer field.
- Store codes + scale (+ metadata as needed).
- Dequantization:
- Read scale per block.
- Decode integer/codebook indices.
- Multiply or look up value, reconstructing the original domain.
Canonical pseudocode for block-absmax quantization:
1 2 3 4 5 6 |
def quantise_block_absmax(theta_block: float[B]): n = max(abs(theta_block[i]) for i in range(B)) u = [theta_block[i]/n for i in range(B)] q = [argmin_k |u[i]-c_k| for i in range(B)] # using precomputed codebook s = encode_float16(n) return s, q |
Hardware-oriented formats (e.g., M²XFP, HiF4) often fuse scale metadata into fast add/shift operations, enabling fixed-point accumulation and deferred floating-point rescaling, while micro-exponent layers or blockwise codebook assignments are incorporated using small, vectorizable lookup tables (Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026, Luo et al., 11 Feb 2026).
5. Empirical Results and Applications
Blockwise quantization data formats are empirically validated across LLMs, vision models, and communication systems:
- LLMs: Direct-cast blockwise int4/BFP yields accuracy loss at $4.5$ bits/param, outperforming both per-tensor fixed-scale schemes and non-blocked integer formats (e.g., Llama 3.1-8B, DeepSeek-V3.1, Mistral-7B) (Orr et al., 19 May 2025, Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026).
- Vision models: On ImageNet, blockwise int4 with error-diffusion matches or exceeds non-blocked int4 accuracy (Khodamoradi et al., 2024).
- Communications: Complex block floating-point with box encoding yields memory savings and EVM increase in QAM transceivers (Choo et al., 2017).
- Training/gradient compression: Blockwise 1-bit quantization with shared per-block scale and error-feedback achieves 32× communication reduction with preserved convergence in distributed ResNet/ImageNet (Zheng et al., 2019).
Advanced variations, such as BCQ (W4A4), adapt codebooks and scaling factors per block, achieving accuracy drop even on large LLMs (Elangovan et al., 7 Feb 2025). M²XFP demonstrates reduction in accuracy loss over vanilla MXFP4 at the same bit rate (Hu et al., 27 Jan 2026).
6. Parameter Selection, Block Size, and Format Design Principles
Selecting block size and per-block bit budget is a trade-off between representational efficiency, quantization error, and computational complexity:
- Block size: Small reduces quantization error but increases scaling metadata overhead per element; large amortizes scale bits but may incur increased clipping. The empirically optimal range is –$128$ for , depending on model sensitivity and hardware constraints (Soloveychik et al., 2022, Luo et al., 11 Feb 2026, Khodamoradi et al., 2024).
- Scale encoding: bfloat16 or higher-mantissa block scales are generally preferred over exponent-only scales, especially for low-precision formats (Orr et al., 19 May 2025).
- Micro-exponents and hierarchical scaling: Hierarchical scaling (e.g., HiF4's three-level metadata) improves intra-block dynamic range coverage without excessive metadata cost, matching PE data widths and enabling all-fixed-point inner loops (Luo et al., 11 Feb 2026).
- Adaptive codebooks and transforms: Learned/local codebooks and blockwise data-aware linear transforms further minimize quantization error at the expense of small lookup table or transform kernel storage (Elangovan et al., 7 Feb 2025, Chen et al., 30 Nov 2025).
- Per-tensor bit allocation: Fisher-weighted allocation techniques optimize the global KL or MSE objective under a total bit constraint (Orr et al., 19 May 2025).
Format design principles derived from empirical studies advocate for using 3-bit significands, hierarchical scaling, large block sizes tailored to hardware PE granularity, and minimal yet judiciously structured metadata (Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026, Orr et al., 19 May 2025).
7. Use Cases, Extensions, and Limitations
Blockwise quantization is a unifying principle for model compression in LLMs, vision, communications, and distributed optimization:
- Deployment and Inference: Enables direct-cast quantization compatible with high-throughput, low-power accelerator architectures.
- Training/Distributed computation: Supports communication-efficient training via blockwise compressed gradients with theoretical convergence guarantees (Zheng et al., 2019).
- Format Extensibility: Naturally integrates with error-diffusion, adaptive or data-aware block transforms (e.g., WUSH), sparse-outlier encoding, and locally optimal codebooks.
- Limitations: Blockwise formats with too-large may degrade when data heterogeneity within blocks increases; fine control over metadata and blockwise statistics is essential to avoid accuracy loss (Soloveychik et al., 2022, Khodamoradi et al., 2024). Fixed-length codes are consistently outperformed by block-adaptive or entropy-coded blockwise approaches (Orr et al., 19 May 2025).
Blockwise quantization forms the current foundation for bit-efficient, accurate, and hardware-aware model deployment across the spectrum of deep learning systems (Orr et al., 19 May 2025, Luo et al., 11 Feb 2026, Khodamoradi et al., 2024, Elangovan et al., 7 Feb 2025, Soloveychik et al., 2022).