Group-wise Quantization Techniques
- Group-wise quantization is the technique of partitioning tensors into non-overlapping groups, each with its own scale and zero-point, thereby reducing quantization error in low-bit regimes.
- It employs configurable grouping methods, such as contiguous blocks, per-kernel groups, and adaptive clustering, to align with hardware efficiency and optimize numerical precision.
- Its applications in large language models, diffusion, and vision transformers demonstrate near full-precision accuracy with improved speed and reduced energy consumption.
Group-wise quantization is a quantization paradigm in which a weight or activation tensor is partitioned into multiple non-overlapping groups, and each group is quantized independently—typically with its own scale and/or zero-point. This fine-grained approach sharply reduces quantization error compared to layer-wise or channel-wise quantization, especially under low-bit regimes, and has become the prevailing method for quantizing large neural networks such as diffusion models, LLMs, and vision transformers, supporting both high-fidelity and efficient inference (Pan et al., 2024).
1. Mathematical Formulation and Core Principles
Given any vectorized tensor (weight or activation), group-wise quantization partitions it into groups of size : For each group, a scale and (optionally) zero-point are adaptively computed: The quantization and dequantization equations per group are
This quantization is performed for each group independently; scale granularity can be per group, per block, or even adaptive across unstructured subsets as in recent binary quantization (Zheng et al., 3 Sep 2025).
Unlike traditional channel-wise (per-column) or layer-wise (whole tensor) quantization, group-wise schemes capitalize on the reduced intra-group variance, leading to better quantization error/accuracy trade-offs, particularly in low-precision settings (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026, Elangovan et al., 7 Feb 2025).
2. Implementation Methodologies
Tensor Partitioning
Group-wise quantization is highly configurable in how groups are defined:
- Contiguous blocks: Most methods flatten tensors and partition into consecutive groups aligned to hardware SIMD widths (e.g., 32/64/128 elements). This maximizes vectorized kernel efficiency (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026).
- Per-kernel groups: In convolutional layers, a group may correspond to each output-channel kernel (Lou et al., 2019), capturing kernel-specific statistics.
- Block and cluster: BCQ (Elangovan et al., 7 Feb 2025) slices tensors into small blocks, then clusters blocks across the tensor, applying optimized codebooks per cluster.
- Unstructured or adaptive grouping: Recent works such as (Zheng et al., 3 Sep 2025) introduce algorithms that adaptively partition (possibly non-contiguous) entries into groups based on statistical similarity.
Scale/Zero-point Computation
Quantization parameters are derived by solving per-group objectives, typically minimizing MSE between the original and quantized weights/activations. For group : 0 Advanced methods further refine these via two-stage or coordinate descent optimizations that account for inter-group correlations and input statistics (Kim et al., 2 Feb 2026), or use post-training stochastic relaxation (e.g., Gumbel-Softmax in GSQ (Dadgarnia et al., 20 Apr 2026)).
Hardware Alignment
Groups are chosen to align with inference kernel memory layouts, ensuring low execution overhead when switching between scale factors. For example, group sizes of 32, 64, or 128 are matched to the vector units of CPUs/GPUs (Pan et al., 2024).
In specialized settings, such as quantized Winograd convolution, only the scales of transform matrices are learned, preserving domain-specific invariants (Pan et al., 2024).
3. Domain-Specific Applications
LLMs
For LLMs, group-wise quantization (often with group sizes 32–128) has become the dominant approach in W4A4, INT3, or even binary regimes (Dadgarnia et al., 20 Apr 2026, Kim et al., 2 Feb 2026, Zhang et al., 2023, Elangovan et al., 7 Feb 2025). Recent advances include:
- Per-group adaptive data types: M-ANT (Hu et al., 26 Feb 2025) introduces a parameterized numeral system per group, adaptively interpolating between grid types (e.g., uniform, power-of-two, flint) to match local value distributions.
- Block clustering: BCQ (Elangovan et al., 7 Feb 2025) clusters blocks by similarity, learning per-cluster codebooks, which yields state-of-the-art compression/accuracy at W4A4.
- Dynamic unstructured grouping: Binary quantization (Zheng et al., 3 Sep 2025) sorts weights and adaptively assigns groups minimizing within-group variance, enabling nearly lossless 1-bit quantization.
Vision and Diffusion Models
In large vision or diffusion models, group-wise quantization is essential for handling heavy-tailed distributions and outliers—channel-wise or layer-wise quantizers severely degrade output quality. DGQ (Ryu et al., 8 Jan 2025) and GDRQ (Yu et al., 2019) dynamically assign groups per channel or pixel dimension, with scales matched to the local value spread to faithfully preserve extremes affecting perceptual fidelity.
Medical Foundation Models
Permutation-COMQ (Chen et al., 9 Apr 2026) further enhances per-channel quantization by permuting weights prior to quantization so that each group (column) comprises weights of similar magnitude, reducing the impact of outliers on scale selection and benefiting medical segmentation tasks.
Transformers & Vision Transformers
Instance-aware group quantization (Moon et al., 2024) dynamically clusters activation channels on a per-instance basis and applies group-specific quantization, greatly improving quantized ViT performance under 4/4-bit constraints.
4. Optimization Algorithms
The methods for finding optimal groupings and quantization parameters have advanced significantly:
- Greedy, dynamic, and windowed grouping: Algorithms span from dynamic programming (exact, cubic complexity) to windowed greedy merge (efficient, scalable) for adaptively partitioning tensors into low-variance groups (Zheng et al., 3 Sep 2025).
- Clustered assignment: Blocks are clustered via K-means or Lloyd-Max to maximize codebook utility (BCQ (Elangovan et al., 7 Feb 2025)).
- Gumbel-Softmax relaxation: In GSQ (Dadgarnia et al., 20 Apr 2026), per-group scales and per-weight assignments are learned by differentiable, noise-annealed softmax relaxation of the discrete grid, then collapsed to hard assignments.
- Two-stage coordinate descent/refinement: (Kim et al., 2 Feb 2026) proposes a two-phase algorithm: initialization using group-wise input statistics, followed by closed-form coordinate-descent refinement using the full block Hessian to minimize true layer-wise loss.
5. Comparative Empirical Performance
Group-wise quantization consistently narrows the performance gap to full-precision baselines at low bit-widths:
- LLMs (2–4 bits): M-ANT (Hu et al., 26 Feb 2025) at group size 64 achieves <0.2 PPL loss on LLaMA-1/2, with 2–4× throughput/energy improvement over fixed-type or INT-only schemes, and GSQ (Dadgarnia et al., 20 Apr 2026) achieves 4–5 point accuracy gains over prior scalar quantizers at 2b.
- Diffusion models: Fully quantized Winograd convolution with group-wise scales (Pan et al., 2024) achieves FID and CLIP scores within 1–4 points of FP16, outperforming all prior Winograd quantization.
- Computer vision: GDRQ (Yu et al., 2019) and DGQ (Ryu et al., 8 Jan 2025) maintain accuracy within 1% of float for classification/detection, and group quantization is indispensable for handling rare but impactful activation outliers.
A representative summary of empirical gains:
| Method / Paper | Architecture | Setting | Metric | Gap to FP16/32 / Notes |
|---|---|---|---|---|
| GSQ (Dadgarnia et al., 20 Apr 2026) | LLaMA-8B/70B | 2b/3b | Zero-shot accuracy | 2b GSQ: +4.8p over GPTQ; Δ≈1–2% FP |
| BCQ (Elangovan et al., 7 Feb 2025) | LLaMA2-70B | W4A4 | Perplexity | Δ=0.09, accuracy loss <1% |
| DGQ (Ryu et al., 8 Jan 2025) | Stable Diffusion, UNet | 8b W/A | FID, CLIP | FID gap −1.3, CLIP gap −0.001 |
| M-ANT (Hu et al., 26 Feb 2025) | LLaMA-1/2/OPT | INT4 | PPL | <0.2 PPL loss, up to 4x speedup |
| Permutation-COMQ (Chen et al., 9 Apr 2026) | MedSAM (ViT-B) | 2b/4b W | DSC/NSD (segmentation) | DSC: −0.07 at 4b; near-float at 2b |
6. Extensions and Theoretical Considerations
Outlier Handling and Distribution Reshaping
Group-wise quantization schemes often integrate outlier detection and distribution reshaping:
- Activation/weight outlier preservation: Cluster-based grouping dynamically isolates outlier vectors, preventing over-compression of extreme values that can compromise output fidelity (DGQ (Ryu et al., 8 Jan 2025), GDRQ (Yu et al., 2019)).
- Distribution reshaping: Scale-Clip in GDRQ drives each group toward a uniform distribution (by adaptive clipping), making quantization more robust at low bits.
Mixed-Precision and Adaptive Types
Emerging work explores per-group adaptive numeric types beyond fixed INT grids (e.g., M-ANT’s parametric polynomial grid (Hu et al., 26 Feb 2025)). Methods also allow mixed-precision assignment per group, finding layer- and group-specific bit-widths to further optimize the accuracy/compression trade-off (Lou et al., 2019).
Dynamic/Instance-Aware Grouping
For architectures with highly instance-dependent activation distributions (e.g., ViTs), online grouping and group-size adaptation per input have been shown to substantially outperform static grouping (Moon et al., 2024).
7. Practical Implications and Limitations
Group-wise quantization is now standard in high-performance model deployment pipelines, offering:
- Enhanced accuracy at low bit-width with negligible additional inference cost when group sizes align with hardware data paths (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026).
- Robustness to outlier-induced quantization artifacts in both weights and activations (Ryu et al., 8 Jan 2025).
- High compatibility with modern PTQ workflows, including efficient calibration and minimal code change from channel-wise baselines (Ichikawa et al., 1 Dec 2025, Dadgarnia et al., 20 Apr 2026).
- Empirically, the total storage overhead for per-group metadata is marginal (e.g., ~0.007 bits/weight in WGM (Zheng et al., 3 Sep 2025)).
Limitations include slight increases in parameter storage and kernel complexity when group sizes are very small, and the need for hardware support to efficiently handle scales and potential type heterogeneity. This suggests that further research on mixed-type and group-wise fused compute kernels could unlock even greater benefits.
In summary, group-wise quantization delivers order-of-magnitude improvements in quantized model fidelity in deep learning, especially in low-precision and resource-constrained domains, and has reshaped the technical landscape of both training-free and hardware-optimized compression pipelines across domains (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025, Zheng et al., 3 Sep 2025).