Fine-Grained Low-Bit Quantization Formats
- Fine-grained low-bit quantization formats are advanced techniques that assign quantization parameters at block, group, or element levels to enhance resource efficiency.
- They combine specialized integer and floating-point methods with adaptive outlier handling to optimize accuracy, memory, and compute trade-offs in large models.
- Leveraging local statistics and hardware-aware strategies like Hadamard rotation and structured masking, these formats achieve near-lossless precision under aggressive bitwidth constraints.
Fine-grained low-bit quantization formats constitute a crucial class of quantization strategies that enable neural networks—especially in deep learning—to retain high accuracy even under aggressive bitwidth constraints (typically 2–4 bits, and in extreme settings, sub-2 bits), with minimal resource consumption. Unlike coarse-grained methods (e.g., per-tensor or per-channel quantization), fine-grained schemes target granularity at the level of blocks, groups, clusters, or even elements, and are increasingly hardware- and application-aware. State-of-the-art approaches carefully design the number format, scale assignment, and outlier management at these fine granularities, leading to improved accuracy-memory-compute trade-offs, particularly in large models such as Transformers, LLMs, and ViT/SR architectures.
1. Fundamental Principles and Design of Fine-Grained Low-Bit Quantization
Fine-grained low-bit quantization formats are characterized by several intertwined dimensions:
- Granularity: The assignment of scaling/zero-point and quantizer parameters at block (e.g., 16/32 elements), group, per-element, or hybrid levels (e.g., group-wise for weights, per-channel for activation, as in (Zhang et al., 2023)).
- Format Type: The numerical representation used—typically integer (INT4/INT2/INT1), floating point (FP8/FP4), or specialized forms (e.g., student/normal float, PoT/APoT, Flint, Abfloat).
- Mixed Precision/Outlier Handling: Allocation of higher bitwidth or special encoding to "salient" or outlier channels/groups, with the remainder aggressively quantized (e.g., PTQ1.61 uses a structured mask identifying error-sensitive channels for 4b, rest binarized (Zhao et al., 18 Feb 2025)).
- Distribution Alignment: Whether the quantization grid is uniform or tailored to match the empirical distribution (e.g., NF, SF, quantization-aware training, or fine-grained BNS alignment in FDDA (Zhong et al., 2021)).
- Blockwise Transformations: Techniques such as Hadamard rotation to reduce blockwise crest factor and equalize value distribution before INT quantization (see (Chen et al., 29 Oct 2025)).
Fine-grained quantization exploits local statistics for scale assignment or format selection, dramatically reducing quantization error relative to coarse-grained approaches—especially critical in layers or architectures with high activation or weight variance and frequent outliers.
2. Quantization Formats: INT, FP, and Hybrid Schemes
Integer Formats
- INT4/INT2/INT1 Block-wise: Assigns a fixed scale/zero-point per block/group of elements, yielding low computational overhead and memory cost. INT4 is a widely adopted choice for LLM deployment given hardware support (e.g., NVIDIA's tensor cores). Ultra-low bit INT1/2 is gaining attention, motivated by new scaling law analyses (Liu et al., 4 Feb 2025).
- Power-of-Two (PoT), Additive PoT (APoT): Constrains weights to or sums thereof, enabling multiplications to be replaced by shift operations, vastly simplifying hardware (Przewlocka-Rus et al., 2022, Yin et al., 2016). APoT increases representation fidelity at the cost of modest computational complexity.
- Structured Masking & Binarization: PTQ1.61 achieves sub-2 bit effective quantization by designating salient channels (by input activation) for 4-bit, and binarizing the rest with block-scale adaptation, using a structured mask requiring just 0.0002 bits/weight (Zhao et al., 18 Feb 2025).
Floating Point Formats
- FP8, FP4, E5M2, E4M3, and Derivatives: FP8/FP4 schemes distribute bits over exponent and mantissa, allowing for robust representation of outliers and improving effective quantization in outlier-dense or heavy-tailed distributions (Kuzmin et al., 2022Chen et al., 13 Aug 2024).
- Mantissa–exponent allocation can be dynamically optimized per tensor to match local statistics (Kuzmin et al., 2022).
- StudentFloat, NormalFloat, Flint, Abfloat: Custom quantization grids matching empirical distributions (e.g., Student's t, Normal) further reduce error in LLMs, especially at ultra-low bits (Gong et al., 25 Sep 2024).
Hybrid and Adaptive Formats
- Mixture-of-Formats Quantization (MoFQ): Select best format (INT or FP) per layer based on local quantization error, exploiting complementary strengths (Zhang et al., 2023).
- Dual-Grained Quantization (DGQ): Pre-dequantizes groupwise INT4 weights to channelwise INT8 at inference, enabling efficient matmuls (A8W4) using standard hardware (Zhang et al., 2023).
- FineQ: Employs fine clusters (3 weights), with efficient outlier protection (3-bit encoding inside cluster) and hardware-aligned memory packing (Xie et al., 28 Apr 2025).
- Hadamard/Blockwise Rotation: Used to mitigate outlier impact in blocks before INT quantization, enabling INT4 formats to rival FP4 in accuracy, as seen in (Chen et al., 29 Oct 2025).
Representative Table: Common Fine-Grained Formats
| Format/Approach | Data Type | Granularity | Outlier Handling |
|---|---|---|---|
| MXINT8/MXFP8 (Chen et al., 29 Oct 2025) | INT8 / FP8 | Block-32 | Symmetric clipping, none |
| NVINT4/NVFP4 (Chen et al., 29 Oct 2025) | INT4 / FP4 | Block-16 | Block Hadamard rotation |
| PTQ1.61 (Zhao et al., 18 Feb 2025) | 1.61-bit hybrid | Channel/blockwise | 4b for salient, binarized rest |
| MoFQ (Zhang et al., 2023) | FP8, INT8, FP4 | Layerwise | Layerwise selection |
| FineQ (Xie et al., 28 Apr 2025) | 2.33b mixed | Cluster-of-3 | In-cluster 3b outlier encode |
3. Algorithmic and Calibration Techniques
- Distribution-Aware Initialization & Adaptation: 2DQuant applies Distribution-Oriented Bound Initialization (DOBI) for setting quantizers based on symmetry/asymmetry and calibrates with output/feature distillation (Liu et al., 10 Jun 2024).
- Distillation and Data-Free Fine-Tuning: EfficientDM uses quantization-aware distilled fine-tuning against FP models, learning quantization parameters for ultra-low bits in diffusion models (He et al., 2023).
- Progressive Fine-to-Coarse Reconstruction: PFCR quantizes ViTs by recursively reconstructing at increasingly coarse granularities, addressing multi-scale information loss in block-wise low-bit quantization (Ding et al., 19 Dec 2024).
Gradient-aware learning, scale optimization (learned per block/timestep/group), and direct optimization of mantissa/exponent allocation (e.g., via grid or gradient search) further advance performance under aggressive quantization.
4. Hardware Implications and Accelerator Design
- Fine-grained quantization formats (especially sparse-by-design) enable aggressive bit-packing, reduced memory bandwidth, and high-throughput, SIMD/SIMT-aligned computation.
- Temporal Coding: FineQ's temporal coding encodes cluster weights as unary bitstreams, eliminating multipliers in MAC units and decreasing chip area and power (Xie et al., 28 Apr 2025).
- Alignment for Efficient Kernels: DGQ's pre-dequantization to channelwise INT8 ensures all matmuls are highly hardware-friendly, critical for LLM-serving with A8W4 (Zhang et al., 2023).
- Comparison of INT vs FP: MXINT8 uses 37% less energy and 21% less area than MXFP8 at block size 32, with higher accuracy (Chen et al., 29 Oct 2025).
- FP and INT Hardware Convergence: On modern AI accelerators (NVIDIA H100, Blackwell), INT8 and FP8 operation cost is virtually identical, enabling per-layer or per-block format flexibility (Kuzmin et al., 2022Chen et al., 13 Aug 2024).
5. Accuracy, Memory, and Efficiency Trade-offs
- Accuracy-Model Size Pareto Front: Well-designed 1.58-, 2-, and 3-bit models reach or surpass the accuracy of 4-bit models at much smaller model sizes, as established by ParetoQ (Liu et al., 4 Feb 2025).
- Lower Bits with Outlier Handling: 4-bit INT blockwise quantization generally underperforms FP4 unless outlier-suppression (block rotation) is applied (Chen et al., 29 Oct 2025); under block rotation, INT4 can match or exceed FP4.
- Layer- and Group-wise Format Selection: MoFQ achieves SOTA accuracy and speed in 4b and 8b scenarios by selecting INT or FP formats per layer (Zhang et al., 2023).
- Compression and Inference Speed: Fine-grained INT4 quantization (with DGQ, SplitQuantV2) can yield up to 3× speedups and 2–4× reduced memory in LLMs (Zhang et al., 2023Song et al., 7 Mar 2025). High sparsity in fine-grained quantization further accelerates inference and reduces power (Yin et al., 2016Przewlocka-Rus et al., 2022).
6. Challenges, Limitations, and Future Directions
- Bitwidth Limitations and Transition Points: There is an empirical "learning transition" between 2 and 3 bits; ≤2-bit networks require larger deviation from initialization ("reconstruction" regime), and >3 bits remain in "compensation" regime with more stable adaptation (Liu et al., 4 Feb 2025).
- Format Adaptation and Hardware Co-Design: A single fixed format is suboptimal; future accelerators should allow block-level FP/INT selection, block/group size flexibility, and format-specific optimizations (Chen et al., 29 Oct 2025Gong et al., 25 Sep 2024).
- Ultra-Low Bitwidth (INT1/INT2/ternary): Remain research challenges due to information bottlenecks, but specially designed quantization functions and schedule adaptation are promising (Liu et al., 4 Feb 2025Salishev et al., 19 Aug 2025).
- Broader Applicability: Methods such as DOBI/DQC in 2DQuant (Liu et al., 10 Jun 2024) and POS in PFCR (Ding et al., 19 Dec 2024) demonstrate extensibility to vision transformers, super-resolution, and other non-LLM architectures.
- Automated Per-Block/Learned Granularity: Ongoing research is pushing for learned or auto-tuned per-block quantizer parameters, format type, or granularity, integrated with deployment hardware (Gong et al., 25 Sep 2024Salishev et al., 19 Aug 2025).
Emerging approaches systematically tune quantization granularity, format, and outlier management in a data-driven or learned manner, and combine these with hardware/software co-design for maximal efficiency—all with the aim of breaking previous accuracy-memory-compute trade-off limitations at the fine scale.
7. Summary Table: Key Fine-Grained Quantization Advances
| Approach | Main Innovation | Bitwidth(s) | Hardware Implication | Accuracy Effect |
|---|---|---|---|---|
| PTQ1.61 (Zhao et al., 18 Feb 2025) | 1D structured mask; binarization+4b split | 1.61 | Sub-2b enabled with negligible mask overhead | SOTA at <2b, incl. first/last layers |
| FineQ (Xie et al., 28 Apr 2025) | In-cluster outlier protection, packed encode | 2.33 avg | Custom PE w/ temporal coding, 1.79× energy eff. | Beats SOTA at similar bitwidth |
| SplitQuantV2 (Song et al., 7 Mar 2025) | Layer splitting (k-means) for INT4 | 4 | No GPU, fast CPU-only deployment, model-agnostic | Recovers FP accuracy at INT4 |
| Atom (Zhao et al., 2023) | Channel outlier reordering & group quantization | 4 | Fused INT4 kernels for matmul & KV-cache | ≤1.4% drop, 7.7× speed FP16 |
| DGQ (Zhang et al., 2023) | Groupwise→channelwise scale folding (A8W4) | 4/8 | Efficient INT8 matmul for W4, memory 1.12×, 3.24× speed | Best or tied, <0.3 PPL loss |
| ParetoQ (Liu et al., 4 Feb 2025) | Unified >1b quant. frontier | 1–4 | INT2 as optimal for future HW, ternary optimal for highest comp. | SOTA in 2–3b over 4b |
References
All findings, equations, and numerical results are verbatim from the cited arXiv sources, including (Zhong et al., 2021, Zhao et al., 18 Feb 2025, Song et al., 7 Mar 2025, Zhao et al., 2023, Liu et al., 4 Feb 2025, Liu et al., 10 Jun 2024, Xie et al., 28 Apr 2025, Przewlocka-Rus et al., 2022, Chen et al., 13 Aug 2024, Yin et al., 2016, Zhang et al., 2023, Guo et al., 17 Nov 2024, Ding et al., 19 Dec 2024, Chen et al., 29 Oct 2025, He et al., 2023, Choi et al., 28 Oct 2025, Salishev et al., 19 Aug 2025, Gong et al., 25 Sep 2024, Kuzmin et al., 2022, Zhang et al., 2023).
Fine-grained low-bit quantization formats leverage localized, adaptive numerical representations—often combined with outlier handling and hardware-aware design—to achieve near-lossless accuracy with minimal resource use. These methods set the empirical and theoretical groundwork for next-generation AI acceleration and large-model deployment at scale.