Fine-grained Quantization in Neural Networks
- FGQ is a neural network quantization method that partitions weights and activations into small groups to assign tailored quantization parameters, improving precision over coarse methods.
- It employs techniques like heuristic block partitioning and Fisher-weighted assignment for minimizing quantization errors and optimizing mixed-precision, enabling reduced bit-width without significant accuracy loss.
- FGQ facilitates hardware-software co-design by using block-level metadata to dynamically select optimally formatted datapaths, yielding substantial memory, energy, and throughput improvements in LLMs and CNNs.
Fine-grained Quantization (FGQ) is an approach to neural network quantization in which quantization parameters (such as scale, zero-point, or even number format) are assigned at a small subset of coefficients—typically groups or blocks containing 2–128 elements—rather than per-tensor or per-channel. FGQ originated to address the limitations of coarse-grained quantization in the presence of highly non-uniform data distributions, as seen in LLMs and deep convolutional networks, where local outliers can destroy signal-to-noise ratios and trigger severe accuracy loss in global quantization schemes. By partitioning weights and activations into independently quantized blocks, FGQ enables dramatic reductions in bit-width (as low as 1.25–4 bits/element) while maintaining high model fidelity, and can be precisely co-optimized for digital hardware through block-level mixed-precision and format selection.
1. Mathematical Foundations and Quantization Algorithms
FGQ formalizes the quantization operator at the block or group level. Given a tensor (weight or activation), it is partitioned into non-overlapping contiguous blocks , each of size . For each block, a dedicated quantization function is applied—most commonly uniform symmetric or asymmetric quantization:
where is the per-block scale, and depends on the bit-width .
This per-block quantization ensures that local dynamic range is utilized efficiently, minimizing the maximum quantization error in each block, and thus improving the mean squared error (MSE) and signal-to-noise ratio (QSNR) relative to per-tensor or per-channel quantization (Kim et al., 2023, Chen et al., 29 Oct 2025, Li et al., 2024).
Algorithmic refinements to basic FGQ include:
- Heuristic block partitioning: Instead of fixed sizes, some methods recursively split blocks to minimize the per-block range as long as quantization error is reduced by a threshold (Kim et al., 2023).
- Fisher-weighted assignment: Quantization decisions are guided by second-order Taylor expansion of loss, weighting quantization-induced perturbations by estimated per-element Fisher information to retain high-precision only for loss-sensitive blocks (Hooper et al., 19 Apr 2025).
- Blockwise mixed-precision: Bit-width and even number format (e.g., INT4/FP4/FP8 per block) are assigned adaptively, optimizing a composite metric of quantization error, memory cost, and/or hardware efficiency (Xie et al., 28 Apr 2025, Jang et al., 2 Jan 2025).
- Hybrid representations (ternary, outlier-aware): Blocks may be ternarized or sparsified with specialized schemes (e.g., 3:4 sparsity in Sherry (Huang et al., 12 Jan 2026), outlier protection in clusters (Xie et al., 28 Apr 2025)), reducing bit-width to or below 2 bits per element.
2. Hardware–Software Co-Design and Inference Optimization
The granular application of quantization parameters enables novel hardware–software codesigns that exploit block-level metadata for efficient computation:
- Mixed-precision datapath selection: Each block carries a metadata flag, steering computation lanes to select the appropriate multiplier/adder unit matching the block's assigned format (e.g., FP8, FP4, INT4) during vector-matrix multiplication. This approach can be clock- or data-gated for further energy reduction (Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025).
- On-the-fly activation quantization: Hardware modules can dynamically quantize activation blocks using sensitivity measures or a two-stage process (e.g., a formatbook lookup in BlockDialect (Jang et al., 2 Jan 2025)) following accumulation, ensuring that the quantized representation matches runtime statistics without introducing significant latency.
- Temporal coding: In ultra-low-bit hardware, temporal coding implements multiplication by repeated addition cycles, allowing small bit-width multiplications and efficient area/power scaling in systolic array architectures (Xie et al., 28 Apr 2025).
- Integer-only scale evaluation: Integer Scale (Li et al., 2024) eliminates float-to-integer conversion bottlenecks in groupwise quantization, replacing floating scales with integer surrogates, thus boosting integer tensor-core utilization and overall throughput.
Collectively, these advances enable near-lossless accuracy in LLMs and other deep models using as little as 2–4 bits per element in weights and activations, with >30% model size reduction and 10–30% energy savings (Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025, Li et al., 2024).
3. Precision–Accuracy–Efficiency Trade-offs
FGQ unlocks a Pareto frontier between model accuracy, memory/computation cost, and inference throughput by tuning block size and bit-width:
| Block Size | Precision | Typical Use | Memory vs. FP | Throughput vs. FP |
|---|---|---|---|---|
| 32, 16 | INT4/FP4 | LLMs, vision | 25–35% | 2–4× |
| 4, 8 | INT2, Ternary | Edge/MCU, low-power | <10% | 9–15× |
| 128 | INT4, INT8 | LLMs (FGQ+IS) | 25% | 2–3× |
- Block size: Smaller blocks yield higher accuracy by better managing outliers but increase storage for scales and metadata and can create conversion/dequantization bottlenecks. Typical sweet spots are block-16 (NVidia) and block-32 (MX).
- Number format: At 8 bits, blockwise integer quantization (MXINT8) matches or outperforms blockwise FP8 in both accuracy and hardware efficiency (Chen et al., 29 Oct 2025). At 4 bits, floating-point (FP4) formats (NVFP4, MXFP4, BlockDialect) provide higher accuracy unless further outlier mitigation (Hadamard rotation, symmetric clipping) is employed for INT4 (Chen et al., 29 Oct 2025, Jang et al., 2 Jan 2025).
- Mixed-precision, mixed-format: Assigning higher bit-width or floating-point to a fraction of highly sensitive blocks further tightens the accuracy–efficiency gap. Fisher-information–guided assignments in FGMP (Hooper et al., 19 Apr 2025) and fine-grained outlier protection in FineQ (Xie et al., 28 Apr 2025) provide <1% accuracy loss at 2–2.3 bits/element.
4. Applications and Specializations
FGQ generalizes across a wide range of neural architectures and deployment scenarios:
- LLMs: Groupwise INT4/FP4, blockwise mixed-precision, and fine-mixed format approaches allow scaling to tens or hundreds of billions of parameters on commodity accelerators, maintaining perplexity and zero-shot QA performance within 1–2% of FP16 baselines (Kim et al., 2023, Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025).
- Vision (CNN, ResNet, VGG): FGQ enables sub-8-bit quantization (as low as ternary, e.g. 1.25 bits/element) with ≤4% drop in top-1 ImageNet accuracy, and provides plug-and-play pipelines for model compression (Mellempudi et al., 2017, Huang et al., 12 Jan 2026).
- Federated Learning: Parameter-level FGQ with adaptive bit allocation minimizes communication overhead, allowing up to 60× compression with lossless convergence compared to uniform quantization (Li et al., 2024).
- Activation quantization: GranQ (Hong et al., 24 Mar 2025) demonstrates that channelwise and blockwise activation quantization with vectorized scaling reduces distortion by 2–3× at low bits, achieving significant accuracy improvements in QAT and zero-shot scenarios.
5. Advanced Techniques: Outlier Mitigation and Mixed Format Schemes
FGQ research emphasizes several strategies for minimizing the impact of rare, large-magnitude coefficients:
- Outlier handling: Hadamard rotations and block-based outlier-switching manage outliers within blocks, enabling low bit-width quantization without severe error amplification (Chen et al., 29 Oct 2025).
- Sensitivity-aware and Fisher-informed quantization: FGMP (Hooper et al., 19 Apr 2025) ranks blocks by their Fisher-weighted loss impact, preserving accuracy-critical regions in higher precision.
- Mixed format selection: BlockDialect assigns each block its own 4-bit floating-point dialect from a formatbook, fully leveraging format diversity to match local statistics and decrease average blockwise quantization error (Jang et al., 2 Jan 2025).
- Hierarchical and dual-scale representations: DGQ (Zhang et al., 2023) and Integer Scale (Li et al., 2024) compose fine-grain quantization (e.g., INT4 groupwise) with a secondary coarser scaling (e.g., INT8 channelwise or integer scaling), preserving hardware efficiency while keeping errors minimal.
6. Experimental Benchmarks and Deployment Considerations
State-of-the-art FGQ methods achieve superior accuracy-efficiency trade-offs across LLM and vision benchmarks:
- LLMs (e.g., Llama-2-7B/13B, Mixtral-8x7B):
- FGMP (Hooper et al., 19 Apr 2025): <1% Δ perpexity at 14% energy and 30% memory savings versus FP8.
- FineQ (Xie et al., 28 Apr 2025): 2.33 bits/weight, <7 PPL drop on C4 versus FP16, 61% hardware area reduction, 1.79× energy efficiency improvement.
- Integer Scale (Li et al., 2024): 2.13–2.31× speedup over FP16 at negligible accuracy loss.
- BlockDialect (Jang et al., 2 Jan 2025): 7.87–7.05 PPL at 4.28 bits on LLaMA2-7B and LLaMA3-8B; 5.45%–2.69% below FP16 when quantizing full matrix multiplications.
- Vision (e.g., ResNet, AlexNet): FGQ with N=4 yields ≤4% accuracy loss and 75% multiply elimination in 2-bit weight and 8-bit activation settings; N=64 plus additional training recovers most accuracy while nearly eliminating multiplications (Mellempudi et al., 2017).
- Edge hardware: Sherry’s 3:4 ternary (1.25 bits/element) packing achieves 10–18% speedup over prior work with zero accuracy loss and 25% bit savings (Huang et al., 12 Jan 2026).
Deployment is further facilitated by vectorized quantization (GranQ), integer-only quantization kernels (Integer Scale), and hardware-aligned bit-packing for SIMD and systolic arrays. Most methods are compatible with existing GPU/TPU inference engines and require little to no additional training or calibration data (Hong et al., 24 Mar 2025, Kim et al., 2023).
7. Theoretical Insights and Open Directions
FGQ’s efficacy is attributed to its ability to strictly dominate coarse-grained quantization in terms of quantization variance, as formalized in parameter-level and blockwise error analyses (Li et al., 2024, Chen et al., 29 Oct 2025). FGQ always yields lower quantization noise for the same total bit budget, and enables provably faster convergence in distributed and federated learning. While the accuracy advantage of integer versus floating-point in FGQ is block-size and bit-width dependent (INT8 consistently superior at block-32, but FP4 favored at 4 bits unless outlier processing is strong (Chen et al., 29 Oct 2025)), future AI hardware is projected to benefit most by implementing flexible, blockwise, mixed-precision integer datapaths with adaptive formatting, as this unlocks superior energy, area, and throughput scalability without accuracy sacrifice.
Ongoing research directions include differentiable FGQ for fully automated hardware-aware bit-width allocation (Cheng et al., 2018), further integration with sparsity and pruning schemes for multiplicative gains, and advanced error modeling for format selection in mixed-signal and quantized communication pipelines.
References:
- (Hooper et al., 19 Apr 2025)
- (Li et al., 2024)
- (Kim et al., 2023)
- (Chen et al., 29 Oct 2025)
- (Huang et al., 12 Jan 2026)
- (Xie et al., 28 Apr 2025)
- (Mellempudi et al., 2017)
- (Li et al., 2024)
- (Hong et al., 24 Mar 2025)
- (Cheng et al., 2018)
- (Li et al., 2023)
- (Zhang et al., 2023)
- (Jang et al., 2 Jan 2025)