Groupwise Quantization in Neural Networks

Updated 13 November 2025

Groupwise quantization is a method that divides neural network weights, activations, or latent representations into groups, each with its own quantization parameters, to minimize quantization error.
It is applied across CNNs, VQ-VAEs, transformers, and diffusion models, employing per-group scaling and tailored clipping techniques to capture local value distributions.
Empirical results show that groupwise quantization reduces error and boosts accuracy in low-bit models, while enabling efficient hardware implementation through optimal grouping strategies.

Groupwise quantization refers to quantization schemes in which neural network weights, activations, or latent representations are partitioned into groups, with each group assigned its own quantization parameters (such as scale and zero-point, or quantizer function). This approach contrasts with per-layer (coarse-grained) or per-channel quantization, offering a finer tradeoff between quantization error, computational overhead, and hardware implementation efficiency. Groupwise quantization has been applied to supervised learning (e.g., CNNs), generative models (e.g., VQ-VAEs), and transformers/LLMs, as well as specialized operations such as Winograd convolutions and text-to-image diffusion models. It underpins high-performance low-bit model deployment, is essential for pushing bitwidths below 8, and enables practicality for applications requiring extremely efficient inference.

1. Formal Definition and Core Principles

Groupwise quantization divides a set of values—weights, activations, or code vectors—into $G$ disjoint groups. For each group, a separate quantization function or set of parameters (scale/zero-point or more advanced quantizers) is derived:

Let $x \in \mathbb{R}^N$ be quantized to $b$ -bit integers in $[q_{\min},q_{\max}]$ .
Partition: $\{G_g\}_{g=1}^G$ , $|G_g| \approx N/G$ .
For group $g$ , compute $s_g$ (scale) and (optionally) $z_g$ (zero-point).
Quantize: $q_{g,i} = \mathrm{clip}(\mathrm{round}((x_{g,i} - z_g)/s_g), q_{\min}, q_{\max})$ for $i \in G_g$ .
Dequantize: $\hat{x}_{g,i} = s_g q_{g,i} + z_g$ .

Groupwise quantization generalizes per-tensor (single scale), per-channel (one scale per channel), and other partitionings. The principal goal is to reduce the quantization error by matching scaling to value distribution variability, especially where global (coarse) quantization would be dominated by outlier values in a few groups.

In weight quantization, groups are typically contiguous rows (e.g., grouping filters in a conv layer (Yu et al., 2019), slices in LLM weight matrices (Zhang et al., 2023)), or arbitrary groupings aligned for hardware efficiency. In vector quantized models, groupwise codebook partitioning (with dedicated projectors per group) achieves high utilization and stability (Zheng et al., 15 Oct 2025). In activation quantization for diffusion models, groups are determined adaptively based on value distributions along channel or pixel axes to capture outlier behavior (Ryu et al., 8 Jan 2025).

2. Methodologies Across Model Families

CNNs and Vision Models

In convolutional networks, groupwise quantization often partitions filters in a conv layer (dimension $C_\mathrm{out}$ ) into $L$ groups of size $g_s$ :

For each group $G_l$ , compute mean absolute weight, set symmetric clipping threshold $T_l^w = k_w \cdot \text{mean}(|G_l|)$ , with $k_w \approx 2$ .
Clip weights: $\bar{w} = \mathrm{clip}(w, -T^w_l, T^w_l)$ .
Compute per-group scale: $s_l = T^w_l / (2^{n_w-1} - 1)$ .
Quantize: $Q(\bar{w}; T^w_l) = \text{round}(\bar{w} / s_l) \cdot s_l$ .

This process is repeated for activations, but with clipping bounds dynamically updated via gradient descent (Yu et al., 2019).

Winograd Convolution

For acceleration, Winograd convolutions transform weights and activations into the Winograd domain. Groupwise quantization is applied both to weights and transformation matrices. Scales for the Winograd transformation (e.g., $S_G$ , $S_B$ ) are finetuned in a data-free fashion using random input tiles, optimizing mean squared error between float and quantized outputs. All quantization in the pipeline is per group (Pan et al., 27 Dec 2024).

Vector Quantized VAEs

Groupwise quantization for VQ-VAEs (termed Group-VQ) splits the codebook $C$ into $k$ non-overlapping groups $G_j$ , each parameterized as $G_j = \hat{G}_j W_j + b_j$ with groupwise projectors $W_j$ and fixed base $\hat{G}_j \sim P$ (Zheng et al., 15 Oct 2025):

Codes in each group updated jointly, but no dependency across groups.
Enables independent post-training codebook resampling and extension.

Transformers and LLMs

In LLMs, fine-grained groupwise quantization quantizes weights in small groups (e.g., 32-row slices) to INT4. To retain inference efficiency, a "dual grained" procedure re-packages groupwise INT4 weights into a coarse-grained INT8 tensor, enabling a single INT8 GEMM, while preserving the original groupwise scaling benefit (Zhang et al., 2023).

Diffusion Models

Distribution-aware groupwise quantization schemes adaptively select grouping direction (channels or pixels)—whichever axis concentrates outlier activations—and apply cluster-based grouping (e.g., $K$ -means) by extremal (max, min) statistics. Each group then receives its own quantizer. Cross-attention scores are quantized with prompt-specific logarithmic quantization, maintaining high text-image alignment (Ryu et al., 8 Jan 2025).

3. Groupwise Quantization Algorithms and Implementation Techniques

Key steps in a standard groupwise quantization workflow across models include:

Grouping Strategy: Partition values according to structural or statistical criteria.
- Example: Contiguous filters per conv layer (Yu et al., 2019); codebook partitioning in VQ-VAEs (Zheng et al., 15 Oct 2025); $K$ -means on activation ranges in diffusion models (Ryu et al., 8 Jan 2025).
Scale (and Zero-Point) Derivation:
- For each group, compute dynamic range and set scale and zero-point, using symmetric or asymmetric quantization as needed. Scale can be min–max, percentile-based, or derived from statistical metrics.
Clipping (Optional):
- Groupwise clipping bounds manage distribution shape (e.g., "Scale-Clip" for making weight distributions uniform-like) (Yu et al., 2019).
- Activation percentile clipping for outlier smoothing in LLMs (Zhang et al., 2023).
Quantization/Dequantization:
- Each group is quantized independently to its assigned bitwidth, then dequantized as required for further computation.
Parameter Fusion / Inference Optimization:
- Groupwise quantization parameters may be merged with downstream affine transforms (e.g., fusing groupwise scale into Batch Normalization, per-channel bias in inference) (Yu et al., 2019).
- "Re-packaging" (in LLMs) converts groupwise INT4 into INT8 GEMM-compatible tensors, maintaining only a global scale for post-GEMM correction (Zhang et al., 2023).
- In diffusion models, activation and cross-attention quantization is prompt/timestep specific, but all quantized representations are cached for efficient inference (Ryu et al., 8 Jan 2025).

The following table summarizes representative grouping strategies across application domains:

Domain	Grouping Basis	Quantization Parameters	Scale Fusion/Optimization
CNNs (vision)	Output filters (conv)	$s_l$ , $T_l^w$ per group	Fused into BN per channel
Winograd convs	Channel/flattened slices	$s_g$ per group, $S_G$	Data-free scale finetuning
VQ-VAEs	Codebook partitions	Projectors $W_j$ , $b_j$	N/A (handled by VQ-commitment loss)
LLMs	Input-channel slices	$s_g$ per group, $s^{(1)}$ channel	INT4 $\to$ INT8 format conversion
Diffusion models	Pixel or channel clusters	$s_k^{(t)}, z_k^{(t)}$	Outlier groups capture extreme values

4. Tradeoffs and Implications for Model Deployment

Groupwise quantization increases quantizer flexibility, substantially reducing quantization error compared to per-layer or per-channel approaches, particularly at bitwidths $\leq 4$ . Fine groupwise partitioning yields improved task accuracy (e.g., $2$-bit quantized ResNet-18 achieves 71.3% on CIFAR-100, versus $64.9\%$ for channel-wise) (Yu et al., 2019), near-lossless FID and CLIP scores for diffusion models in groupwise+Winograd (Pan et al., 27 Dec 2024), and within $0.2-0.3$ perplexity of FP16 LLMs at INT4/INT8 (Zhang et al., 2023).

However, finer groupings increase the memory/computation required to store and manage multiple sets of scale and zero-point values, and can hinder inference speed if not mitigated by fusion or re-packaging. Practical systems therefore:

Restrict group size to a multiple of hardware vector width for efficient integer GEMM (e.g., 32 or 64 elements).
Employ fusion at compile time (e.g., merging groupwise scale into BN or post-GEMM dequantization), incurring at most a small one-time arithmetic cost.
In VQ codebooks, moderate group sizes (e.g., $n_j \approx 32\text{--}64$ ) are empirically optimal for code usage and expressivity, while excessive groups yield diminishing returns (Zheng et al., 15 Oct 2025).
In diffusion models, the number of groups ( $K=16$ ) and timesteps ( $T=25$ ) lead to a negligible memory overhead ( $\approx 2.3$ MB), while yielding perceptually superior outputs even at very low bits (Ryu et al., 8 Jan 2025).

5. Representative Performance and Empirical Evidence

The following table summarizes selected experimental findings on the impact of groupwise quantization:

Model/Task	Quantization	Accuracy/FID/CLIP	Relative Drop
VGG-16-BN (ImageNet) (Yu et al., 2019)	[4,4]-bit GQ	72.5%	-0.1 pt
ResNet-50 (ImageNet) (Yu et al., 2019)	[2,4]-bit GQ	73.9%	-0.9 pt
InstaFlow-0.9B (COCO-5k) (Pan et al., 27 Dec 2024)	FP16	FID=23.00, CLIP=30.19	--
	W8A8 GQ-Wino	FID=27.05, CLIP=29.58	+4.05/+0.61
LLaMA-7B (WikiText-2) (Zhang et al., 2023)	A8W4 (DGQ)	PPL=5.85	+0.17
VQGAN (ImageNet-1k) (Zheng et al., 15 Oct 2025)	Group-VQ $n=65\,536,k=64$	rFID=1.86	lower than SimVQ/VQGAN-LC
Stable Diffusion (MS-COCO) (Ryu et al., 8 Jan 2025)	8W/8A TFMQ	FID=18.85, CLIP=0.286	--
	8W/8A DGQ	FID=13.15, CLIP=0.297	improved

As evidenced, groupwise quantization enables lossless or near-lossless quantization at aggressive bitwidths across architectures and tasks, provided that grouping/fusion is properly tuned.

6. Open Questions and Frontiers

Active research continues regarding the optimal degree of group granularity, adaptation strategies, and extension beyond basic uniform or linear quantizers:

For VQ-VAEs, open issues remain regarding impact on downstream generative modeling, and the tradeoff curve between codebook expressivity, group size, and utilization (Zheng et al., 15 Oct 2025).
In diffusion models, further gains may be possible via combining groupwise quantization with advanced outlier handling or dynamic grouping policies (Ryu et al., 8 Jan 2025).
Hardware and inference overhead mitigation techniques—such as conversion between INT4 and INT8 formats for efficient computation while retaining groupwise accuracy—have enabled practical deployment for LLMs (Zhang et al., 2023).
Additional open directions include exploring nonlinear projectors or hierarchical groupings (in VQ), and data-driven group composition.

Groupwise quantization thus provides a foundational mechanism for low-bit, high-accuracy neural network quantization, balancing computational tractability and empirical performance across vision, language, and generative modeling domains.