Group-wise Weight Quantization

Updated 26 May 2026

Group-wise weight quantization is a method that divides weight tensors into small groups, enabling locally adaptive scaling to minimize quantization error.
It employs techniques such as K-means clustering, permutation-based grouping, and adaptive codebook approaches to optimize low-bit precision in diverse network architectures.
Optimized partition strategies balance accuracy, memory overhead, and hardware efficiency, making them integral for deploying efficient LLMs, vision models, and generative networks.

Group-wise weight quantization is a set of methodologies in low-precision neural network inference that partitions the weight tensor of a neural network into small groups, blocks, or clusters. Each group is quantized using distinct parameters—typically its own scale (and possibly zero-point/codebook)—allowing the quantizer to better match local data statistics and thereby reducing overall quantization error relative to global, layer-wise, or channel-wise quantization schemes. Group-wise quantization is now standard in efficient LLM, vision, and generative network deployment, owing to its favorable trade-off between accuracy, memory, compute efficiency, and hardware practicalities.

1. Mathematical Formulation and Partitioning Strategies

Group-wise quantization divides a weight tensor $W$ into $G$ non-overlapping groups—indexed as $\{ W_g \}_{g=1}^G$ . The choice of partitioning depends on the layer topology and desired hardware compatibility:

Spatially contiguous blocks: 1D or 2D sub-blocks in flattened tensors.
Per-output channel: Each output channel (i.e., convolutional kernel or linear layer row) forms a group.
K-means/group clustering: Clustering based on statistics (e.g., per-channel max, variance) or range variability (Ryu et al., 8 Jan 2025, Nie et al., 2022).

For each group, a dedicated quantizer is constructed—typically characterized by a per-group scale $s_g$ , offset/zero-point $z_g$ , or (for VQ methods) a group codebook.

General framework:

For $b$ -bit quantization, the typical affine quantizer is:

$Q_g(w) = s_g \cdot \text{clip}\big(\text{round}\big(\frac{w}{s_g}\big)+z_g, a_\text{min}, a_\text{max}\big)$

with dequantization $w \approx s_g (q - z_g)$ , $q \in \mathbb{Z}, a_\text{min}\leq q \leq a_\text{max}$ , all parameters local to group $g$ .

This design enables isolating outlier-heavy regions, exploiting local redundancy, and increasing quantization granularity where necessary.

2. Group-wise Quantization Algorithms and Variants

2.1 Uniform and Asymmetric Group-wise Quantization

The baseline is uniform, symmetric (or asymmetric) quantization applied per group, with scales and possible zero-points adapted to the local statistics. For group $G$ 0 of weights $G$ 1:

Uniform, symmetric: $G$ 2; $G$ 3 (Zhang et al., 2023, Nie et al., 2022, Ryu et al., 8 Jan 2025).
Dynamic axis selection: In DGQ for diffusion models, the maximal range axis is automatically detected to capture outlier structure, and groups are formed by K-means on local range tuples $G$ 4, with quantization scales set accordingly (Ryu et al., 8 Jan 2025).
Permutation-based grouping: In Permutation-COMQ, rows are permuted to cluster similar-magnitude entries per column before independent quantization, minimizing within-group scale range and maximizing dynamic resolution (Chen et al., 9 Apr 2026).

2.2 Non-uniform, Adaptive, and Codebook Approaches

Mathematically adaptive numeric types: MANT parameterizes the group-wise quantization grid by a per-group parameter $G$ 5, allowing the quantizer to smoothly interpolate between log-uniform, power-of-two, and normal-float behavior (Hu et al., 26 Feb 2025). This is learned or selected per group based on minimizing the induced output error.
Block/codebook clustering: BCQ (LO-BCQ) partitions blocks of weights, clusters blocks into a small set of codebooks, and quantizes each block with its corresponding codebook, yielding near-optimal MSE (Elangovan et al., 7 Feb 2025). This is iteratively optimized via block–cluster re-assignment and Lloyd–Max codebook updates.
Gumbel-Softmax relaxation: GSQ directly learns discrete grid assignments for each weight (within a group having a shared scale) via differentiable Gumbel-Softmax relaxation, jointly optimizing per-group scales and per-coordinate grid points (Dadgarnia et al., 20 Apr 2026).

2.3 Group-wise PTQ and QAT Extensions

Two-stage grid optimization: Stage 1 minimizes group-wise reconstruction loss using calibration activations; Stage 2 uses coordinate descent to jointly refine all group scales to minimize the full layer-wise output error, incorporating Hessian structure and propagation of quantization error from prior layers (Kim et al., 2 Feb 2026).
DL-QAT: Assigns each group a learnable scale (quantization magnitude) and corrects remaining quantization error with a local low-rank LoRA update (trained with QAT, touching <1% parameters), yielding state-of-the-art low-bit accuracy with extreme compute efficiency (Ke et al., 12 Apr 2025).
Kernel-wise quantization via DRL: AutoQ uses hierarchical RL to simultaneously allocate per-kernel QBN (bit number) given target accuracy/latency/energy, and can automatically discover non-uniformly quantized configurations per group (Lou et al., 2019).

3. Hardware Implications, Efficiency, and Folding Strategies

3.1 Inference-Efficient Parameterization

Per-group scaling: Design accommodates efficient integer GEMM/conv kernels by storing per-group scales, which can be folded into downstream components—such as BatchNorm $G$ 6 in vision nets (Yu et al., 2019) or fused as runtime per-block coefficients in transformer matmuls (Hu et al., 26 Feb 2025).
Adaptive decode efficiency: MANT’s a-param grid yields integer/shift MACs fused within systolic arrays, enabling high-throughput 4-bit inference (Hu et al., 26 Feb 2025). GSQ and group-wise scalar/clustered approaches are drop-ins for standard INT4/INT8 GEMM kernels (Dadgarnia et al., 20 Apr 2026, Elangovan et al., 7 Feb 2025).
Dual-grained recasting: In LLMs, dual-grained quantization dequantizes groupwise INT4 weights to INT8 and performs all inference with a single INT8 kernel (CUTLASS/GPU), leveraging both groupwise accuracy and coarse-grained hardware efficiency (Zhang et al., 2023).

3.2 Training-Time and Runtime Complexity

Group-wise parameter optimizations can be performed via:
- Closed-form/minimum error projections (Kim et al., 2 Feb 2026, Chen et al., 9 Apr 2026)
- Lightweight K-means (blocks/ranges/statistics) (Ryu et al., 8 Jan 2025, Nie et al., 2022)
- Multi-objective or RL-based search (when joint hardware and accuracy optimization required) (Lou et al., 2019).

Compared to layer-wise approaches, group-wise variants introduce modest additional parameters (e.g., per-group scale, small codebook/parameter tables), but have negligible runtime or memory impact on contemporary hardware for group sizes $G$ 7 (Hu et al., 26 Feb 2025, Elangovan et al., 7 Feb 2025).

4. Empirical Performance and Trade-offs

Extensive experiments across LLMs, vision, and generative models consistently establish that group-wise or block-quantization is a dominant regime for sub-8-bit and especially sub-4-bit quantization, with notable findings:

Vision: On ResNet-18 (CIFAR-100, 2-bit), per-filter (gs=1) groupwise quantization attains 71.3% top-1 (float: 73%) versus 64.9% for layerwise (Yu et al., 2019). AdderNet, with group-quantization and lossless clamp/outlier handling, recovers 66.5% top-1 at 4-bit PTQ, outperforming global scaling by 8.5 points (Nie et al., 2022).
LLMs: Groupwise INT4 in LLaMA-7B improves PPL from 6.85 (channelwise) to 5.8 (group size 128), while MANT-W4A8 further narrows the gap to FP16 (PPL 5.79) (Hu et al., 26 Feb 2025). Sophisticated grid or codebook approaches close the delta to “vector quantization” at $G$ 8 as in BCQ (LO-BCQ) and GSQ (2520.05376, Dadgarnia et al., 20 Apr 2026). Two-stage scale optimization provides PPL and accuracy enhancements over GPTQ at 2–3 bits (Kim et al., 2 Feb 2026).
Diffusion & Generative Models: DGQ with dynamic axis/group selection and prompt-specific log quantization attains $G$ 9 in W8A8 StableDiffusion, matching FP. At W4A6, using 16 groups, FID drops from $\{ W_g \}_{g=1}^G$ 0 to 43.66 (baseline), with group-wise DJQ at 0.263 CLIP (vs 0.127 baseline) (Ryu et al., 8 Jan 2025).

Key trade-offs:

Group size: Smaller groups (32–128) capture more local variability but increase metadata/storage; larger groups have lower overhead but higher error.
Codebook/parameter overhead: Codebook-based methods (BCQ) and per-group-adaptive types (MANT) trade extra per-group parameters for significant accuracy at ultra-low bitwidths.
Hardware interface: Selection of quantization format and group size must balance between inference kernel compatibility, scale lookup efficiency, and accelerator/ASIC resource usage.

5. Algorithmic and Implementation Workflows

5.1 Unified Pipelines

The group-wise quantization process involves several typical algorithmic phases:

Phase	Description	Representative Approaches
1. Grouping/Clustering	Partition weights by axis, clustering, or contiguous blocks	(Ryu et al., 8 Jan 2025, Nie et al., 2022, Zhang et al., 2023, Chen et al., 9 Apr 2026)
2. Scale/Parameter search	Determine per-group quantizer scales or codebooks (possibly via output-layer error minimization)	(Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025)
3. Quantization	Apply group-wise quantization, coordinate/codebook lookup, and optional outlier correction or clamp	(Yu et al., 2019, Nie et al., 2022, Ke et al., 12 Apr 2025)
4. Post-processing	Fold scales into downstream normalization/Bias (e.g., BatchNorm $\{ W_g \}_{g=1}^G$ 1 adjustment), group-aware dequantization logic	(Yu et al., 2019, Hu et al., 26 Feb 2025)
5. (Optional) QAT	Fine-tune group magnitudes and LoRA low-rank correction for error minimization in the deployment regime	(Ke et al., 12 Apr 2025)

Pseudocode snippets for selected approaches can be found in (Yu et al., 2019, Ryu et al., 8 Jan 2025, Nie et al., 2022, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025), formally specifying the groupwise quantization, block clustering, and scale/codebook updates.

5.2 Outlier Handling and Lossless Clamp

Handling outliers by groupwise clustering (K-means on range, magnitude permutation (Ryu et al., 8 Jan 2025, Chen et al., 9 Apr 2026)), lossless clamping plus bias compensation (Nie et al., 2022), or activation percentile-based squeezing (Zhang et al., 2023) is central to preventing representational collapse with global scaling.

5.3 Hardware Folding and Efficient Execution

Folding per-group scale factors into BatchNorm parameters during inference ensures parameter-free runtime cost, as shown in GDRQ (Yu et al., 2019). Parallel systolic hardware paths, as in MANT, enable high-throughput 4-bit MACs via simultaneous accumulation/shift logic (Hu et al., 26 Feb 2025).

6. Applications, Impact, and Best Practices

Group-wise weight quantization is adopted across a diversity of application areas:

Computer vision: ResNet, VGG, and AdderNet architectures—group-wise approaches consistently deliver higher accuracy at 2–4 bit regimes than layer/row-wise approaches (Yu et al., 2019, Nie et al., 2022).
LLMs and transformers: Standard for sub-8-bit quantization in LLaMA, OPT, GPT-3, Kimi-K2.5, and Mixture-of-Experts models, with kernels and quantization toolkits directly supporting group/block formats (Elangovan et al., 7 Feb 2025, Dadgarnia et al., 20 Apr 2026, Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026).
Diffusion and generative models: Distribution-aware group selection achieves state-of-the-art low bitwidth deployment without architectural surgery or retraining (Ryu et al., 8 Jan 2025).
Medical foundation models: Permuted group-wise quantization delivers SOTA DSC and NSD in calibration-only settings (Chen et al., 9 Apr 2026).

Best practices:

For sub-8-bit deployment, $\{ W_g \}_{g=1}^G$ 2– $\{ W_g \}_{g=1}^G$ 3 is widely adopted, balancing metadata and quantization error (Hu et al., 26 Feb 2025, Elangovan et al., 7 Feb 2025).
Outlier-aware grouping and per-group parameterization are crucial for regimes below 4 bits (Ryu et al., 8 Jan 2025, Nie et al., 2022).
For mixed-precision and hardware-optimized inference, group-wise dequantization and folding should align with kernel organization (row/column blocks, scale folding).

7. Limitations, Extensions, and Ongoing Research

Overhead: Small group sizes increase metadata per parameter; however, this is negligible (e.g., 0.1% model size at 16 groups/layer for UNet (Ryu et al., 8 Jan 2025); single-parameter $\{ W_g \}_{g=1}^G$ 4 and scale per group for MANT (Hu et al., 26 Feb 2025)).
Activation quantization: Most approaches treat activations separately; joint group-wise schemes are area of active research (Zhang et al., 2023, Elangovan et al., 7 Feb 2025).
Extensions: K-means and permutation strategies can generalize to blocks of arbitrary shapes and even non-linear group selection (e.g., attention outlier detection in DGQ) (Ryu et al., 8 Jan 2025). Unifying runtime scale/shift logic is ongoing in LLM accelerator design (Hu et al., 26 Feb 2025).
Codebook/parameter LUTs: Clustering methods with per-group codebooks introduce ROM/codebook lookup paths, which must be balanced against kernel simplicity (Elangovan et al., 7 Feb 2025).
Task transfer: Empirical evidence suggests that group-wise schemes often generalize across tasks and domains, and can be composed with QAT or LoRA/post-training finetuning for further accuracy boosts (Ke et al., 12 Apr 2025, Nie et al., 2022).

Group-wise and block quantization continue to evolve as the empirical and algorithmic backbone for ultra-low bitwidth neural network deployment across modalities and scales, underlying both academic innovation and industry platforms (Yu et al., 2019, Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026, Dadgarnia et al., 20 Apr 2026).