Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group-wise Quantization Techniques

Updated 26 May 2026
  • Group-wise quantization is the technique of partitioning tensors into non-overlapping groups, each with its own scale and zero-point, thereby reducing quantization error in low-bit regimes.
  • It employs configurable grouping methods, such as contiguous blocks, per-kernel groups, and adaptive clustering, to align with hardware efficiency and optimize numerical precision.
  • Its applications in large language models, diffusion, and vision transformers demonstrate near full-precision accuracy with improved speed and reduced energy consumption.

Group-wise quantization is a quantization paradigm in which a weight or activation tensor is partitioned into multiple non-overlapping groups, and each group is quantized independently—typically with its own scale and/or zero-point. This fine-grained approach sharply reduces quantization error compared to layer-wise or channel-wise quantization, especially under low-bit regimes, and has become the prevailing method for quantizing large neural networks such as diffusion models, LLMs, and vision transformers, supporting both high-fidelity and efficient inference (Pan et al., 2024).

1. Mathematical Formulation and Core Principles

Given any vectorized tensor XRNX\in\mathbb{R}^N (weight or activation), group-wise quantization partitions it into MM groups of size GG: X=[X(1),X(2),,X(M)],X(m)RG,M=N/G.X = \bigl[ X^{(1)},\, X^{(2)},\,\dots,\,X^{(M)} \bigr],\quad X^{(m)}\in\mathbb{R}^G,\, M=N/G. For each group, a scale sms_m and (optionally) zero-point zmz_m are adaptively computed: sm=maxiximinixiqmaxqmin,zm=minixi.s_m = \frac{\max_i x_i - \min_i x_i}{q_{\max} - q_{\min}},\quad z_m = \min_i x_i. The quantization and dequantization equations per group are

xint,i(m)=clamp(round(xizmsm),qmin,qmax),x^{(m)}_{\mathrm{int},i} = \mathrm{clamp}\left(\mathrm{round}\left(\frac{x_i-z_m}{s_m}\right), q_{\min}, q_{\max}\right),

x^i=smxint,i(m)+zm.\hat{x}_i = s_m\, x^{(m)}_{\mathrm{int},i} + z_m.

This quantization is performed for each group independently; scale granularity can be per group, per block, or even adaptive across unstructured subsets as in recent binary quantization (Zheng et al., 3 Sep 2025).

Unlike traditional channel-wise (per-column) or layer-wise (whole tensor) quantization, group-wise schemes capitalize on the reduced intra-group variance, leading to better quantization error/accuracy trade-offs, particularly in low-precision settings (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026, Elangovan et al., 7 Feb 2025).

2. Implementation Methodologies

Tensor Partitioning

Group-wise quantization is highly configurable in how groups are defined:

  • Contiguous blocks: Most methods flatten tensors and partition into consecutive groups aligned to hardware SIMD widths (e.g., 32/64/128 elements). This maximizes vectorized kernel efficiency (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026).
  • Per-kernel groups: In convolutional layers, a group may correspond to each output-channel kernel (Lou et al., 2019), capturing kernel-specific statistics.
  • Block and cluster: BCQ (Elangovan et al., 7 Feb 2025) slices tensors into small blocks, then clusters blocks across the tensor, applying optimized codebooks per cluster.
  • Unstructured or adaptive grouping: Recent works such as (Zheng et al., 3 Sep 2025) introduce algorithms that adaptively partition (possibly non-contiguous) entries into groups based on statistical similarity.

Scale/Zero-point Computation

Quantization parameters are derived by solving per-group objectives, typically minimizing MSE between the original and quantized weights/activations. For group mm: MM0 Advanced methods further refine these via two-stage or coordinate descent optimizations that account for inter-group correlations and input statistics (Kim et al., 2 Feb 2026), or use post-training stochastic relaxation (e.g., Gumbel-Softmax in GSQ (Dadgarnia et al., 20 Apr 2026)).

Hardware Alignment

Groups are chosen to align with inference kernel memory layouts, ensuring low execution overhead when switching between scale factors. For example, group sizes of 32, 64, or 128 are matched to the vector units of CPUs/GPUs (Pan et al., 2024).

In specialized settings, such as quantized Winograd convolution, only the scales of transform matrices are learned, preserving domain-specific invariants (Pan et al., 2024).

3. Domain-Specific Applications

LLMs

For LLMs, group-wise quantization (often with group sizes 32–128) has become the dominant approach in W4A4, INT3, or even binary regimes (Dadgarnia et al., 20 Apr 2026, Kim et al., 2 Feb 2026, Zhang et al., 2023, Elangovan et al., 7 Feb 2025). Recent advances include:

  • Per-group adaptive data types: M-ANT (Hu et al., 26 Feb 2025) introduces a parameterized numeral system per group, adaptively interpolating between grid types (e.g., uniform, power-of-two, flint) to match local value distributions.
  • Block clustering: BCQ (Elangovan et al., 7 Feb 2025) clusters blocks by similarity, learning per-cluster codebooks, which yields state-of-the-art compression/accuracy at W4A4.
  • Dynamic unstructured grouping: Binary quantization (Zheng et al., 3 Sep 2025) sorts weights and adaptively assigns groups minimizing within-group variance, enabling nearly lossless 1-bit quantization.

Vision and Diffusion Models

In large vision or diffusion models, group-wise quantization is essential for handling heavy-tailed distributions and outliers—channel-wise or layer-wise quantizers severely degrade output quality. DGQ (Ryu et al., 8 Jan 2025) and GDRQ (Yu et al., 2019) dynamically assign groups per channel or pixel dimension, with scales matched to the local value spread to faithfully preserve extremes affecting perceptual fidelity.

Medical Foundation Models

Permutation-COMQ (Chen et al., 9 Apr 2026) further enhances per-channel quantization by permuting weights prior to quantization so that each group (column) comprises weights of similar magnitude, reducing the impact of outliers on scale selection and benefiting medical segmentation tasks.

Transformers & Vision Transformers

Instance-aware group quantization (Moon et al., 2024) dynamically clusters activation channels on a per-instance basis and applies group-specific quantization, greatly improving quantized ViT performance under 4/4-bit constraints.

4. Optimization Algorithms

The methods for finding optimal groupings and quantization parameters have advanced significantly:

  • Greedy, dynamic, and windowed grouping: Algorithms span from dynamic programming (exact, cubic complexity) to windowed greedy merge (efficient, scalable) for adaptively partitioning tensors into low-variance groups (Zheng et al., 3 Sep 2025).
  • Clustered assignment: Blocks are clustered via K-means or Lloyd-Max to maximize codebook utility (BCQ (Elangovan et al., 7 Feb 2025)).
  • Gumbel-Softmax relaxation: In GSQ (Dadgarnia et al., 20 Apr 2026), per-group scales and per-weight assignments are learned by differentiable, noise-annealed softmax relaxation of the discrete grid, then collapsed to hard assignments.
  • Two-stage coordinate descent/refinement: (Kim et al., 2 Feb 2026) proposes a two-phase algorithm: initialization using group-wise input statistics, followed by closed-form coordinate-descent refinement using the full block Hessian to minimize true layer-wise loss.

5. Comparative Empirical Performance

Group-wise quantization consistently narrows the performance gap to full-precision baselines at low bit-widths:

  • LLMs (2–4 bits): M-ANT (Hu et al., 26 Feb 2025) at group size 64 achieves <0.2 PPL loss on LLaMA-1/2, with 2–4× throughput/energy improvement over fixed-type or INT-only schemes, and GSQ (Dadgarnia et al., 20 Apr 2026) achieves 4–5 point accuracy gains over prior scalar quantizers at 2b.
  • Diffusion models: Fully quantized Winograd convolution with group-wise scales (Pan et al., 2024) achieves FID and CLIP scores within 1–4 points of FP16, outperforming all prior Winograd quantization.
  • Computer vision: GDRQ (Yu et al., 2019) and DGQ (Ryu et al., 8 Jan 2025) maintain accuracy within 1% of float for classification/detection, and group quantization is indispensable for handling rare but impactful activation outliers.

A representative summary of empirical gains:

Method / Paper Architecture Setting Metric Gap to FP16/32 / Notes
GSQ (Dadgarnia et al., 20 Apr 2026) LLaMA-8B/70B 2b/3b Zero-shot accuracy 2b GSQ: +4.8p over GPTQ; Δ≈1–2% FP
BCQ (Elangovan et al., 7 Feb 2025) LLaMA2-70B W4A4 Perplexity Δ=0.09, accuracy loss <1%
DGQ (Ryu et al., 8 Jan 2025) Stable Diffusion, UNet 8b W/A FID, CLIP FID gap −1.3, CLIP gap −0.001
M-ANT (Hu et al., 26 Feb 2025) LLaMA-1/2/OPT INT4 PPL <0.2 PPL loss, up to 4x speedup
Permutation-COMQ (Chen et al., 9 Apr 2026) MedSAM (ViT-B) 2b/4b W DSC/NSD (segmentation) DSC: −0.07 at 4b; near-float at 2b

6. Extensions and Theoretical Considerations

Outlier Handling and Distribution Reshaping

Group-wise quantization schemes often integrate outlier detection and distribution reshaping:

  • Activation/weight outlier preservation: Cluster-based grouping dynamically isolates outlier vectors, preventing over-compression of extreme values that can compromise output fidelity (DGQ (Ryu et al., 8 Jan 2025), GDRQ (Yu et al., 2019)).
  • Distribution reshaping: Scale-Clip in GDRQ drives each group toward a uniform distribution (by adaptive clipping), making quantization more robust at low bits.

Mixed-Precision and Adaptive Types

Emerging work explores per-group adaptive numeric types beyond fixed INT grids (e.g., M-ANT’s parametric polynomial grid (Hu et al., 26 Feb 2025)). Methods also allow mixed-precision assignment per group, finding layer- and group-specific bit-widths to further optimize the accuracy/compression trade-off (Lou et al., 2019).

Dynamic/Instance-Aware Grouping

For architectures with highly instance-dependent activation distributions (e.g., ViTs), online grouping and group-size adaptation per input have been shown to substantially outperform static grouping (Moon et al., 2024).

7. Practical Implications and Limitations

Group-wise quantization is now standard in high-performance model deployment pipelines, offering:

Limitations include slight increases in parameter storage and kernel complexity when group sizes are very small, and the need for hardware support to efficiently handle scales and potential type heterogeneity. This suggests that further research on mixed-type and group-wise fused compute kernels could unlock even greater benefits.

In summary, group-wise quantization delivers order-of-magnitude improvements in quantized model fidelity in deep learning, especially in low-precision and resource-constrained domains, and has reshaped the technical landscape of both training-free and hardware-optimized compression pipelines across domains (Pan et al., 2024, Dadgarnia et al., 20 Apr 2026, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025, Zheng et al., 3 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group-wise Quantization.