Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Groupwise Quantization in Neural Networks

Updated 13 November 2025
  • Groupwise quantization is a method that divides neural network weights, activations, or latent representations into groups, each with its own quantization parameters, to minimize quantization error.
  • It is applied across CNNs, VQ-VAEs, transformers, and diffusion models, employing per-group scaling and tailored clipping techniques to capture local value distributions.
  • Empirical results show that groupwise quantization reduces error and boosts accuracy in low-bit models, while enabling efficient hardware implementation through optimal grouping strategies.

Groupwise quantization refers to quantization schemes in which neural network weights, activations, or latent representations are partitioned into groups, with each group assigned its own quantization parameters (such as scale and zero-point, or quantizer function). This approach contrasts with per-layer (coarse-grained) or per-channel quantization, offering a finer tradeoff between quantization error, computational overhead, and hardware implementation efficiency. Groupwise quantization has been applied to supervised learning (e.g., CNNs), generative models (e.g., VQ-VAEs), and transformers/LLMs, as well as specialized operations such as Winograd convolutions and text-to-image diffusion models. It underpins high-performance low-bit model deployment, is essential for pushing bitwidths below 8, and enables practicality for applications requiring extremely efficient inference.

1. Formal Definition and Core Principles

Groupwise quantization divides a set of values—weights, activations, or code vectors—into GG disjoint groups. For each group, a separate quantization function or set of parameters (scale/zero-point or more advanced quantizers) is derived:

  • Let xRNx \in \mathbb{R}^N be quantized to bb-bit integers in [qmin,qmax][q_{\min},q_{\max}].
  • Partition: {Gg}g=1G\{G_g\}_{g=1}^G, GgN/G|G_g| \approx N/G.
  • For group gg, compute sgs_g (scale) and (optionally) zgz_g (zero-point).
  • Quantize: qg,i=clip(round((xg,izg)/sg),qmin,qmax)q_{g,i} = \mathrm{clip}(\mathrm{round}((x_{g,i} - z_g)/s_g), q_{\min}, q_{\max}) for iGgi \in G_g.
  • Dequantize: x^g,i=sgqg,i+zg\hat{x}_{g,i} = s_g q_{g,i} + z_g.

Groupwise quantization generalizes per-tensor (single scale), per-channel (one scale per channel), and other partitionings. The principal goal is to reduce the quantization error by matching scaling to value distribution variability, especially where global (coarse) quantization would be dominated by outlier values in a few groups.

In weight quantization, groups are typically contiguous rows (e.g., grouping filters in a conv layer (Yu et al., 2019), slices in LLM weight matrices (Zhang et al., 2023)), or arbitrary groupings aligned for hardware efficiency. In vector quantized models, groupwise codebook partitioning (with dedicated projectors per group) achieves high utilization and stability (Zheng et al., 15 Oct 2025). In activation quantization for diffusion models, groups are determined adaptively based on value distributions along channel or pixel axes to capture outlier behavior (Ryu et al., 8 Jan 2025).

2. Methodologies Across Model Families

CNNs and Vision Models

In convolutional networks, groupwise quantization often partitions filters in a conv layer (dimension CoutC_\mathrm{out}) into LL groups of size gsg_s:

  • For each group GlG_l, compute mean absolute weight, set symmetric clipping threshold Tlw=kwmean(Gl)T_l^w = k_w \cdot \text{mean}(|G_l|), with kw2k_w \approx 2.
  • Clip weights: wˉ=clip(w,Tlw,Tlw)\bar{w} = \mathrm{clip}(w, -T^w_l, T^w_l).
  • Compute per-group scale: sl=Tlw/(2nw11)s_l = T^w_l / (2^{n_w-1} - 1).
  • Quantize: Q(wˉ;Tlw)=round(wˉ/sl)slQ(\bar{w}; T^w_l) = \text{round}(\bar{w} / s_l) \cdot s_l.

This process is repeated for activations, but with clipping bounds dynamically updated via gradient descent (Yu et al., 2019).

Winograd Convolution

For acceleration, Winograd convolutions transform weights and activations into the Winograd domain. Groupwise quantization is applied both to weights and transformation matrices. Scales for the Winograd transformation (e.g., SGS_G, SBS_B) are finetuned in a data-free fashion using random input tiles, optimizing mean squared error between float and quantized outputs. All quantization in the pipeline is per group (Pan et al., 27 Dec 2024).

Vector Quantized VAEs

Groupwise quantization for VQ-VAEs (termed Group-VQ) splits the codebook CC into kk non-overlapping groups GjG_j, each parameterized as Gj=G^jWj+bjG_j = \hat{G}_j W_j + b_j with groupwise projectors WjW_j and fixed base G^jP\hat{G}_j \sim P (Zheng et al., 15 Oct 2025):

  • Codes in each group updated jointly, but no dependency across groups.
  • Enables independent post-training codebook resampling and extension.

Transformers and LLMs

In LLMs, fine-grained groupwise quantization quantizes weights in small groups (e.g., 32-row slices) to INT4. To retain inference efficiency, a "dual grained" procedure re-packages groupwise INT4 weights into a coarse-grained INT8 tensor, enabling a single INT8 GEMM, while preserving the original groupwise scaling benefit (Zhang et al., 2023).

Diffusion Models

Distribution-aware groupwise quantization schemes adaptively select grouping direction (channels or pixels)—whichever axis concentrates outlier activations—and apply cluster-based grouping (e.g., KK-means) by extremal (max, min) statistics. Each group then receives its own quantizer. Cross-attention scores are quantized with prompt-specific logarithmic quantization, maintaining high text-image alignment (Ryu et al., 8 Jan 2025).

3. Groupwise Quantization Algorithms and Implementation Techniques

Key steps in a standard groupwise quantization workflow across models include:

  1. Grouping Strategy: Partition values according to structural or statistical criteria.
  2. Scale (and Zero-Point) Derivation:
    • For each group, compute dynamic range and set scale and zero-point, using symmetric or asymmetric quantization as needed. Scale can be min–max, percentile-based, or derived from statistical metrics.
  3. Clipping (Optional):
    • Groupwise clipping bounds manage distribution shape (e.g., "Scale-Clip" for making weight distributions uniform-like) (Yu et al., 2019).
    • Activation percentile clipping for outlier smoothing in LLMs (Zhang et al., 2023).
  4. Quantization/Dequantization:
    • Each group is quantized independently to its assigned bitwidth, then dequantized as required for further computation.
  5. Parameter Fusion / Inference Optimization:
    • Groupwise quantization parameters may be merged with downstream affine transforms (e.g., fusing groupwise scale into Batch Normalization, per-channel bias in inference) (Yu et al., 2019).
    • "Re-packaging" (in LLMs) converts groupwise INT4 into INT8 GEMM-compatible tensors, maintaining only a global scale for post-GEMM correction (Zhang et al., 2023).
    • In diffusion models, activation and cross-attention quantization is prompt/timestep specific, but all quantized representations are cached for efficient inference (Ryu et al., 8 Jan 2025).

The following table summarizes representative grouping strategies across application domains:

Domain Grouping Basis Quantization Parameters Scale Fusion/Optimization
CNNs (vision) Output filters (conv) sls_l, TlwT_l^w per group Fused into BN per channel
Winograd convs Channel/flattened slices sgs_g per group, SGS_G Data-free scale finetuning
VQ-VAEs Codebook partitions Projectors WjW_j, bjb_j N/A (handled by VQ-commitment loss)
LLMs Input-channel slices sgs_g per group, s(1)s^{(1)} channel INT4 \to INT8 format conversion
Diffusion models Pixel or channel clusters sk(t),zk(t)s_k^{(t)}, z_k^{(t)} Outlier groups capture extreme values

4. Tradeoffs and Implications for Model Deployment

Groupwise quantization increases quantizer flexibility, substantially reducing quantization error compared to per-layer or per-channel approaches, particularly at bitwidths 4\leq 4. Fine groupwise partitioning yields improved task accuracy (e.g., $2$-bit quantized ResNet-18 achieves 71.3% on CIFAR-100, versus 64.9%64.9\% for channel-wise) (Yu et al., 2019), near-lossless FID and CLIP scores for diffusion models in groupwise+Winograd (Pan et al., 27 Dec 2024), and within $0.2-0.3$ perplexity of FP16 LLMs at INT4/INT8 (Zhang et al., 2023).

However, finer groupings increase the memory/computation required to store and manage multiple sets of scale and zero-point values, and can hinder inference speed if not mitigated by fusion or re-packaging. Practical systems therefore:

  • Restrict group size to a multiple of hardware vector width for efficient integer GEMM (e.g., 32 or 64 elements).
  • Employ fusion at compile time (e.g., merging groupwise scale into BN or post-GEMM dequantization), incurring at most a small one-time arithmetic cost.
  • In VQ codebooks, moderate group sizes (e.g., nj3264n_j \approx 32\text{--}64) are empirically optimal for code usage and expressivity, while excessive groups yield diminishing returns (Zheng et al., 15 Oct 2025).
  • In diffusion models, the number of groups (K=16K=16) and timesteps (T=25T=25) lead to a negligible memory overhead (2.3\approx 2.3 MB), while yielding perceptually superior outputs even at very low bits (Ryu et al., 8 Jan 2025).

5. Representative Performance and Empirical Evidence

The following table summarizes selected experimental findings on the impact of groupwise quantization:

Model/Task Quantization Accuracy/FID/CLIP Relative Drop
VGG-16-BN (ImageNet) (Yu et al., 2019) [4,4]-bit GQ 72.5% -0.1 pt
ResNet-50 (ImageNet) (Yu et al., 2019) [2,4]-bit GQ 73.9% -0.9 pt
InstaFlow-0.9B (COCO-5k) (Pan et al., 27 Dec 2024) FP16 FID=23.00, CLIP=30.19 --
W8A8 GQ-Wino FID=27.05, CLIP=29.58 +4.05/+0.61
LLaMA-7B (WikiText-2) (Zhang et al., 2023) A8W4 (DGQ) PPL=5.85 +0.17
VQGAN (ImageNet-1k) (Zheng et al., 15 Oct 2025) Group-VQ n=65536,k=64n=65\,536,k=64 rFID=1.86 lower than SimVQ/VQGAN-LC
Stable Diffusion (MS-COCO) (Ryu et al., 8 Jan 2025) 8W/8A TFMQ FID=18.85, CLIP=0.286 --
8W/8A DGQ FID=13.15, CLIP=0.297 improved

As evidenced, groupwise quantization enables lossless or near-lossless quantization at aggressive bitwidths across architectures and tasks, provided that grouping/fusion is properly tuned.

6. Open Questions and Frontiers

Active research continues regarding the optimal degree of group granularity, adaptation strategies, and extension beyond basic uniform or linear quantizers:

  • For VQ-VAEs, open issues remain regarding impact on downstream generative modeling, and the tradeoff curve between codebook expressivity, group size, and utilization (Zheng et al., 15 Oct 2025).
  • In diffusion models, further gains may be possible via combining groupwise quantization with advanced outlier handling or dynamic grouping policies (Ryu et al., 8 Jan 2025).
  • Hardware and inference overhead mitigation techniques—such as conversion between INT4 and INT8 formats for efficient computation while retaining groupwise accuracy—have enabled practical deployment for LLMs (Zhang et al., 2023).
  • Additional open directions include exploring nonlinear projectors or hierarchical groupings (in VQ), and data-driven group composition.

Groupwise quantization thus provides a foundational mechanism for low-bit, high-accuracy neural network quantization, balancing computational tractability and empirical performance across vision, language, and generative modeling domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Groupwise Quantization.