Papers
Topics
Authors
Recent
2000 character limit reached

Fine-Grained Quantization

Updated 27 December 2025
  • Fine-Grained Quantization (FGQ) is a method that quantizes neural network weights in small blocks using individual scales to adapt for local value variations.
  • It employs adaptive group sizing and mixed-precision strategies to balance quantization error with compression and hardware efficiency.
  • FGQ drives modern hardware optimizations by enabling fused dequantization and efficient integer scaling, enhancing inference in large language and vision models.

Fine-grained quantization (FGQ) refers to quantization strategies in which each small block, group, cluster, or even individual element of a neural network’s weights and/or activations is quantized using its own scale or precision. This is in contrast to coarser schemes such as per-tensor or per-channel quantization, where all elements in a large block share a single scale and format. FGQ is motivated by the observation that local dynamic range and outlier distributions vary significantly within neural network parameters, and thus adaptive, spatially refined quantization granularity can achieve superior accuracy, higher compression rates, and improved energy efficiency. FGQ is widely adopted in contemporary LLMs, vision models, and edge deployments, driving efficient inference pipelines and specialized hardware designs.

1. Quantization Granularity and Mathematical Foundations

FGQ partitions neural network tensors along the reduction or data (dot-product) dimension into contiguous, non-overlapping blocks or clusters of fixed or adaptive size. Each block is quantized independently.

The canonical quantization function for a block BB is:

Q(w)=clamp(round(w/sB),Qmin,Qmax)×sB,Q(w) = \operatorname{clamp}\left(\mathrm{round}(w / s_B), Q_{\min}, Q_{\max}\right) \times s_B,

where sBs_B is the block scale, and the clamping interval [Qmin,Qmax][Q_{\min}, Q_{\max}] matches the target quantizer. Mixed or custom formats (integer/floating-point) are also supported depending on the deployment (Chen et al., 29 Oct 2025, Jang et al., 2 Jan 2025, Hooper et al., 19 Apr 2025).

2. Adaptive Granularity Selection and Outlier Mitigation

  • Adaptive Group Size: Algorithms select block size GG by recursively partitioning along the reduction dimension, using a “range-growth” threshold τ\tau to balance quantization error against overhead (Kim et al., 2023). Typically, block size is halved (e.g., from per-column to 64, 32, …) until dynamic range growth for any group exceeds τ\tau. This allows for finer granularity where outlier presence or variation is high, and coarser granularity otherwise.
  • Sensitivity/Variance-based Assignment: In mixed-precision FGQ, a block's precision is determined by the expected perturbation to the loss, often via Fisher information. High-sensitivity blocks are retained at higher precision, while the majority are aggressively quantized (Hooper et al., 19 Apr 2025, Cheng et al., 2018). For federated learning, parameter-level quantization is optimized by simulated annealing to minimize variance under a total bit budget (Li et al., 16 Aug 2024).
  • Outlier Protection: FGQ schemes such as FineQ flag clusters exhibiting excessive dynamic range ratios (e.g., max(C)/min(C)>4\max(C)/\min(C) > 4) and apply higher-precision quantization or selective dropping/zeroing (Xie et al., 28 Apr 2025, Chen et al., 29 Oct 2025). Hadamard rotation may be invoked to reduce block crest-factor and enable more aggressive quantization (Chen et al., 29 Oct 2025).
  • Per-Channel Vectorization: For activation quantization, vectorized per-channel or per-block scales ensure efficient exploitation of activation statistics, particularly in QAT or zero-shot regimes (Hong et al., 24 Mar 2025).

3. Mixed-Precision and Formatbook Approaches

FGQ schemes often transcend uniform quantization; they dynamically select precision and number formats per block.

  • Block-wise Mixed Precision: Each block is scored for quantization “hardness”; blocks likely to cause large accuracy degradation are assigned higher-precision formats (e.g., FP8), while the rest are quantized to low-precision (e.g., FP4) (Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025, Sun et al., 1 May 2024, Cheng et al., 2018).
  • Dialect/Formatbook-based Assignment: BlockDialect exemplifies selection from a set (“formatbook”) of flexible floating-point number formats (DialectFP4): for each block, the dialect is chosen to best fit the observed or anticipated value distribution, minimizing quantization error without excessive bit overhead (Jang et al., 2 Jan 2025).
  • Encoding and Memory Layout: FineQ clusters encode mode bits and quantized data for each cluster in an index-data concatenation layout, aligning data for efficient on-chip fetch and pipelined decoding (Xie et al., 28 Apr 2025).
Approach Block granularity Format assignment Outlier handling
FineQuant (Kim et al., 2023) Fixed/adaptive Uniform integer, b = 4, 8 Adaptive block, τ
FGMP (Hooper et al., 19 Apr 2025) Block, <100<100 FP4, FP8 (mixed via Fisher ranking) Fisher threshold
FineQ (Xie et al., 28 Apr 2025) 3-weight cluster 2b/3b, index encoding Per-cluster b↑
BlockDialect (Jang et al., 2 Jan 2025) 32 DialectFP4 selection per block Dynamic dialect
Ternary FGQ (Mellempudi et al., 2017) N=4–64 Ternary/group, per-group α, Δ N tuning

4. Hardware and Inference Efficiency Considerations

FGQ directly informs accelerator design and system-level throughput by leveraging blockwise data structure and mixed-precision support.

  • Fused Dequantization and GEMM: On GPUs, blockwise (FGQ) schemes support fused matrix multiplication plus on-the-fly dequantization kernels, with each tile or submatrix using the relevant per-block scales (Kim et al., 2023, Zhang et al., 2023). This maximizes arithmetic intensity and enables direct utilization of low-precision (e.g., INT8, INT4) tensor cores.
  • Temporal Coding and PE Sizing: FineQ’s temporal coding replaces conventional MACs with narrow PEs operating at the block level, reducing array area by up to 61.2% and boosting energy efficiency by up to 1.79× (Xie et al., 28 Apr 2025).
  • Blockwise Formatbooks for Alignment: BlockDialect’s integer MACs use statically-aligned block and index structures for high-speed fetch and minimal control logic (Jang et al., 2 Jan 2025).
  • Plug-and-Play Integerization: Integer Scale transforms FGQ’s block scales from floating-point to integer, removing per-block floating-point multiplies and consolidating scaling into a single division at layer output, yielding up to 1.85–2.31× speedup over FP16 (Li et al., 23 May 2024).
  • Inference/Training Crossover: At 8 bits, blockwise INT formats (e.g., MXINT8) are strictly superior to FP (MXFP8) in both algorithmic and hardware metrics. At 4 bits, crossovers are crest-factor-dependent; outlier-mitigated INT4 can surpass FP4 (Chen et al., 29 Oct 2025).
Hardware Feature Supported Approach Empirical Impact
Fused GEMM+Dequant FineQuant, DGQ, FPTQ 3.2–3.65× speedup, 2–4× memory cut
Temporal coding, PEs FineQ 61.2% area, 62.9% power reduction
Integer scaling Integer Scale (plug-and-play) 1.85–2.31× end-to-end speedup
Blockwise MACs/Dialects BlockDialect 2.3× energy vs INT8/FP4, ≤5.45% accuracy drop

5. Empirical Performance, Accuracy, and Compression Trade-offs

FGQ enables ultra-low-precision deployments (4–8 bits) with minimal accuracy loss compared to full-precision baselines and consistently outperforms coarse-quantization at comparable compression ratios.

  • LLMs: FineQuant achieves up to 3.65× real-world throughput and 4× memory reduction for OPT-175B with <0.5 perplexity loss at int4(64)-block (Kim et al., 2023). FGMP attains <1% PPL degradation or lower across LLaMA-2-7B (Hooper et al., 19 Apr 2025). FineQ reduces LLaMA-2-7B C4 perplexity from 39.45 (OWQ, 2.25bpp) to 14.95 (2.33bpp) (Xie et al., 28 Apr 2025).
  • Vision: Ternary FGQ (N=4) maintains Top-1 within 3.7–4.3% of FP32 on ResNet-50/101, removing up to 75% of multiplications, and up to 99% at N=64 (with retraining) (Mellempudi et al., 2017).
  • Federated Learning: Per-coordinate FGQ (FedFQ) achieves 27–63× communication compression and matches full-precision accuracy across CIFAR-10 and Shakespeare FL benchmarks (Li et al., 16 Aug 2024).
  • Hardware Co-design: BlockDialect achieves up to 11.4 percentage points higher accuracy over MXFP4 at equivalent bitrates, with ≤5.45% drop from FP16 in full-path matrix multiplications at ~4.3 bits/block (Jang et al., 2 Jan 2025).
  • Mixed-Precision Pareto: Gradient-optimized, per-weight bit allocation (HGQ) reaches up to 20× resource and 5× latency reduction versus uniform quantization, without loss in accuracy, in FPGA targets (Sun et al., 1 May 2024).

6. Limitations and Best-Practice Recommendations

  • Kernel and Tile Size Coupling: Hardware kernels often lock block size to hardware tile (e.g., 64 rows); group sizes >16 may require new kernels (Kim et al., 2023).
  • FP32 Fusion Overhead: Prior to Integer Scale, FGQ’s per-block float scaling causes significant inference bottlenecks; integerization remedies this, making FGQ speed-competitive with FP16 (Li et al., 23 May 2024).
  • Scale and Bit-Overhead: Very fine blocks increase meta-data (scales/indices); optimal group sizes (e.g., 32–128) balance accuracy and memory (Jang et al., 2 Jan 2025, Kim et al., 2023).
  • Sensitivity to Outliers: Uniform schemes risk catastrophic collapse in outlier-prone blocks; mixed-precision assignment, outlier-aware clustering, or Hadamard rotation mitigate this (Chen et al., 29 Oct 2025, Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025).
  • Data Dependence: Some methods (e.g., GranQ) require synthetic or real data to estimate block/channel extrema; if these are underestimated, range clipping may cause loss (Hong et al., 24 Mar 2025).

Recommended Practices:

  • Use adaptive or per-block scale selection with τ ∈ [0.75,1.0] for 4-bit (Kim et al., 2023).
  • For LLMs, int4(64)-block or adaptive FGQ achieves >90% of full-precision accuracy recovery.
  • In mixed precision, retain <10% of blocks at high precision for <1% PPL degradation (Hooper et al., 19 Apr 2025).
  • Integerize per-block scales wherever possible for inference efficiency (Li et al., 23 May 2024).
  • Exploit fused GEMM+Dequant kernels and hardware block-alignment for throughput (Zhang et al., 2023, Xie et al., 28 Apr 2025).
  • For federated or edge learning, optimizer-driven per-parameter bit allocation achieves state-of-the-art communication efficiency (Li et al., 16 Aug 2024).

7. Outlook and Research Frontiers

FGQ is the current best practice for quantization in state-of-the-art LLMs, vision models, and hardware-aware deployment. Increasingly, its integration spans:

  • Hypergranular trainable bit allocation: Continuous/gradient-based assignment for optimal accuracy-resource tradeoff (Sun et al., 1 May 2024, Cheng et al., 2018).
  • Formatbook and assignment logic: BlockDialect and similar techniques suggest future co-evolution of number representation and quantizer at block scale (Jang et al., 2 Jan 2025).
  • INT vs FP hardware co-design: Emerging evidence contradicts the exclusive focus on low-precision FP formats, advocating for mixed INT + FP pipelines at fine granularity (Chen et al., 29 Oct 2025).
  • Plug-and-play acceleration: Integer Scale and fast blockwise implementations set new expectations for combining accuracy and inference speed in production systems (Li et al., 23 May 2024).
  • Adaptive, learning-driven quantizer design: Integration of feedback from loss surfaces and Fisher information for per-block or per-channel quantization sensitivity (Hooper et al., 19 Apr 2025).

A plausible implication is that future models, training regimes, and hardware will increasingly use FGQ as the standard, leveraging mixed-format, mixed-precision blockwise representations as the optimal trade-off between accuracy, efficiency, and deployability.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Quantization (FGQ).