Papers
Topics
Authors
Recent
Search
2000 character limit reached

Blockwise Quantization Formats

Updated 25 February 2026
  • Blockwise quantization is a numerical method that partitions vectors or tensors into fixed blocks, applying a shared scale to reduce quantization error.
  • The approach optimizes compression and hardware efficiency by adapting bit allocation based on local statistics and using methodologies like block floating-point and adaptive transforms.
  • Applications include deep neural network inference, training, and communications, achieving low-bit deployment with minimal accuracy degradation.

Blockwise quantization data formats are a class of numerical representations for vectors and tensors that group contiguous elements into blocks, applying shared quantization parameters within each block. This structure is designed to optimize the balance between quantization error, compression ratio, and hardware efficiency, with broad application across deep neural network inference, training, and communication. Blockwise quantization includes formats such as block floating-point (BFP), block-int with shared scale, block-scaled number formats, and methods integrating blockwise clustering or adaptive transforms. The approach excels in minimizing dynamic range problems and stochastic quantization error by leveraging local statistics, enabling efficient low-bit deployment without significant accuracy loss.

1. Core Principles and Formal Definition

In blockwise quantization, a high-dimensional tensor (e.g., weight matrix, activation vector, or gradient) is partitioned into non-overlapping blocks of fixed size BB. Each block is quantized independently using a shared scale parameter ("block scale"), typically determined by RMS, absmax, or another tailored statistic:

  • Blockwise quantization mapping:
    • For a block {θi}i=1B\{\theta_i\}_{i=1}^B, compute scale n=θblockn = \|\theta_\text{block}\| (absmax, RMS, etc.).
    • Normalize and quantize each entry: qi=Q(θi/n)q_i = Q(\theta_i / n), where QQ is the codebook, producing e.g., belemb_\text{elem}-bit element codes.
    • Store the BB codes and the per-block scale.

The total bits per parameter are bavg=belem+bscaleBb_\text{avg} = b_\text{elem} + \frac{b_\text{scale}}{B}, where bscaleb_\text{scale} is the bitwidth for scale encoding.

This construct generalizes to diverse blockwise formats:

Variable-length encoding naturally emerges at block granularity—blocks with large or outlier-dominated values require more bits on average due to their scale representation, effectively providing adaptive bit allocation (Orr et al., 19 May 2025).

2. Theoretical Framework and Error Analysis

The design and analysis of blockwise formats rely on a statistical and information-theoretic foundation:

  • KL divergence and squared-error: Optimal quantization minimizes Ex[DKL(pθ(x)pθ~(x))]\mathbb{E}_x[\mathrm{D}_{KL}(p_\theta(\cdot|x) \Vert p_{\tilde\theta}(\cdot|x))], constrained by a total bit budget (Orr et al., 19 May 2025). In the local quadratic regime, this reduces to minimizing Fisher-weighted 2\ell_2 error.
  • Block scaling and error: Blockwise scaling exploits local value distribution to reduce mean-squared quantization error for a fixed bit rate, outperforming fixed-scale quantization under typical data distributions (Gaussian, Laplace, Student-t). Analytically, SBFP with optimal scaling yields Var(ΔE)=O(BlnB/22p)\operatorname{Var}(\Delta E) = O(B\ln B/2^{2p}) for pp-bit mantissas; practical block scaling in blockwise int4 formats achieves low KL divergence and <1%<1\% accuracy degradation in LLMs (Soloveychik et al., 2022, Orr et al., 19 May 2025, Khodamoradi et al., 2024).
  • Bit allocation across tensors: The optimal average bits per tensor tt in a multi-tensor model is

bt=b0+log2(RMS(θTt))+12log2(fˉt),b_t^* = b^0 + \log_2(\operatorname{RMS}(\theta_{T_t})) + \frac{1}{2}\log_2(\bar f_t),

yielding up to $0.25$ bits/param savings for large models (Orr et al., 19 May 2025).

  • Block size optimization: For a given mantissa width, the optimal block size is determined by balancing quantization noise and scaling error. E.g., for p=4p=4 bits, B64B^* \approx 64 (Soloveychik et al., 2022).

3. Data Formats, Metadata, and Storage Layout

Blockwise formats are distinguished by their local metadata structure and exact bitfield organization. The most common patterns are illustrated in the table:

Format (example) Data Payload Per-Block Metadata
BFP, SBFP B pp-bit mantissas Shared exponent or scale
Block integer + scale B belemb_\text{elem}-int bscaleb_\text{scale} bits
Clustered quant. B belemb_\text{elem}-codes Codebook selector, scale
Multi-level block FP B bb-bit float-like 1-3 scales, micro-exponents

Memory layouts typically pack all code fields contiguously, followed by per-block scales (and, if needed, codebook selectors or further micro-exponents) (Luo et al., 11 Feb 2026, Dong et al., 2023, Elangovan et al., 7 Feb 2025).

Floating-point block formats (e.g., HiFloat4, BFP, SBFP) may add multiple scaling levels—HiF4, for example, uses global/8-local/16-finer exponents for 64 elements, blending inter/intra-group dynamic range compression (Luo et al., 11 Feb 2026). Codebook-based blockwise formats (e.g., BCQ) store scalar-to-codebook indices and per-block codebook IDs for locally optimal codebook utilization (Elangovan et al., 7 Feb 2025).

4. Quantization, Dequantization, and Algorithmic Implementation

The quantization and dequantization process in blockwise formats is uniform:

  • Quantization:
    • Compute per-block statistics (absmax, RMS, etc.).
    • Normalize entries, quantize to codebook or integer field.
    • Store codes + scale (+ metadata as needed).
  • Dequantization:
    • Read scale per block.
    • Decode integer/codebook indices.
    • Multiply or look up value, reconstructing the original domain.

Canonical pseudocode for block-absmax quantization:

1
2
3
4
5
6
def quantise_block_absmax(theta_block: float[B]):
    n = max(abs(theta_block[i]) for i in range(B))
    u = [theta_block[i]/n for i in range(B)]
    q = [argmin_k |u[i]-c_k| for i in range(B)]  # using precomputed codebook
    s = encode_float16(n)
    return s, q
Extensions incorporate blockwise error-diffusion (Khodamoradi et al., 2024), cluster assignment (BCQ (Elangovan et al., 7 Feb 2025)), or adaptive transforms (WUSH (Chen et al., 30 Nov 2025)) in the quantization loop.

Hardware-oriented formats (e.g., M²XFP, HiF4) often fuse scale metadata into fast add/shift operations, enabling fixed-point accumulation and deferred floating-point rescaling, while micro-exponent layers or blockwise codebook assignments are incorporated using small, vectorizable lookup tables (Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026, Luo et al., 11 Feb 2026).

5. Empirical Results and Applications

Blockwise quantization data formats are empirically validated across LLMs, vision models, and communication systems:

  • LLMs: Direct-cast blockwise int4/BFP yields <1%<1\% accuracy loss at $4.5$ bits/param, outperforming both per-tensor fixed-scale schemes and non-blocked integer formats (e.g., Llama 3.1-8B, DeepSeek-V3.1, Mistral-7B) (Orr et al., 19 May 2025, Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026).
  • Vision models: On ImageNet, blockwise int4 with error-diffusion matches or exceeds non-blocked int4 accuracy (Khodamoradi et al., 2024).
  • Communications: Complex block floating-point with box encoding yields 20%\sim 20\% memory savings and <0.01%<0.01\% EVM increase in QAM transceivers (Choo et al., 2017).
  • Training/gradient compression: Blockwise 1-bit quantization with shared per-block scale and error-feedback achieves 32× communication reduction with preserved convergence in distributed ResNet/ImageNet (Zheng et al., 2019).

Advanced variations, such as BCQ (W4A4), adapt codebooks and scaling factors per block, achieving 1%\leq 1\% accuracy drop even on large LLMs (Elangovan et al., 7 Feb 2025). M²XFP demonstrates 70.6%70.6\% reduction in accuracy loss over vanilla MXFP4 at the same bit rate (Hu et al., 27 Jan 2026).

6. Parameter Selection, Block Size, and Format Design Principles

Selecting block size and per-block bit budget is a trade-off between representational efficiency, quantization error, and computational complexity:

  • Block size: Small BB reduces quantization error but increases scaling metadata overhead per element; large BB amortizes scale bits but may incur increased clipping. The empirically optimal range is B=32B=32–$128$ for p=4p=4, depending on model sensitivity and hardware constraints (Soloveychik et al., 2022, Luo et al., 11 Feb 2026, Khodamoradi et al., 2024).
  • Scale encoding: bfloat16 or higher-mantissa block scales are generally preferred over exponent-only scales, especially for low-precision formats (Orr et al., 19 May 2025).
  • Micro-exponents and hierarchical scaling: Hierarchical scaling (e.g., HiF4's three-level metadata) improves intra-block dynamic range coverage without excessive metadata cost, matching PE data widths and enabling all-fixed-point inner loops (Luo et al., 11 Feb 2026).
  • Adaptive codebooks and transforms: Learned/local codebooks and blockwise data-aware linear transforms further minimize quantization error at the expense of small lookup table or transform kernel storage (Elangovan et al., 7 Feb 2025, Chen et al., 30 Nov 2025).
  • Per-tensor bit allocation: Fisher-weighted allocation techniques optimize the global KL or MSE objective under a total bit constraint (Orr et al., 19 May 2025).

Format design principles derived from empirical studies advocate for using 3-bit significands, hierarchical scaling, large block sizes tailored to hardware PE granularity, and minimal yet judiciously structured metadata (Luo et al., 11 Feb 2026, Hu et al., 27 Jan 2026, Orr et al., 19 May 2025).

7. Use Cases, Extensions, and Limitations

Blockwise quantization is a unifying principle for model compression in LLMs, vision, communications, and distributed optimization:

  • Deployment and Inference: Enables direct-cast quantization compatible with high-throughput, low-power accelerator architectures.
  • Training/Distributed computation: Supports communication-efficient training via blockwise compressed gradients with theoretical convergence guarantees (Zheng et al., 2019).
  • Format Extensibility: Naturally integrates with error-diffusion, adaptive or data-aware block transforms (e.g., WUSH), sparse-outlier encoding, and locally optimal codebooks.
  • Limitations: Blockwise formats with too-large BB may degrade when data heterogeneity within blocks increases; fine control over metadata and blockwise statistics is essential to avoid accuracy loss (Soloveychik et al., 2022, Khodamoradi et al., 2024). Fixed-length codes are consistently outperformed by block-adaptive or entropy-coded blockwise approaches (Orr et al., 19 May 2025).

Blockwise quantization forms the current foundation for bit-efficient, accurate, and hardware-aware model deployment across the spectrum of deep learning systems (Orr et al., 19 May 2025, Luo et al., 11 Feb 2026, Khodamoradi et al., 2024, Elangovan et al., 7 Feb 2025, Soloveychik et al., 2022).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blockwise Quantization Data Format.