Group-wise Weight Quantization
- Group-wise weight quantization is a method that divides weight tensors into small groups, enabling locally adaptive scaling to minimize quantization error.
- It employs techniques such as K-means clustering, permutation-based grouping, and adaptive codebook approaches to optimize low-bit precision in diverse network architectures.
- Optimized partition strategies balance accuracy, memory overhead, and hardware efficiency, making them integral for deploying efficient LLMs, vision models, and generative networks.
Group-wise weight quantization is a set of methodologies in low-precision neural network inference that partitions the weight tensor of a neural network into small groups, blocks, or clusters. Each group is quantized using distinct parameters—typically its own scale (and possibly zero-point/codebook)—allowing the quantizer to better match local data statistics and thereby reducing overall quantization error relative to global, layer-wise, or channel-wise quantization schemes. Group-wise quantization is now standard in efficient LLM, vision, and generative network deployment, owing to its favorable trade-off between accuracy, memory, compute efficiency, and hardware practicalities.
1. Mathematical Formulation and Partitioning Strategies
Group-wise quantization divides a weight tensor into non-overlapping groups—indexed as . The choice of partitioning depends on the layer topology and desired hardware compatibility:
- Spatially contiguous blocks: 1D or 2D sub-blocks in flattened tensors.
- Per-output channel: Each output channel (i.e., convolutional kernel or linear layer row) forms a group.
- K-means/group clustering: Clustering based on statistics (e.g., per-channel max, variance) or range variability (Ryu et al., 8 Jan 2025, Nie et al., 2022).
For each group, a dedicated quantizer is constructed—typically characterized by a per-group scale , offset/zero-point , or (for VQ methods) a group codebook.
General framework:
For -bit quantization, the typical affine quantizer is:
with dequantization , , all parameters local to group .
This design enables isolating outlier-heavy regions, exploiting local redundancy, and increasing quantization granularity where necessary.
2. Group-wise Quantization Algorithms and Variants
2.1 Uniform and Asymmetric Group-wise Quantization
The baseline is uniform, symmetric (or asymmetric) quantization applied per group, with scales and possible zero-points adapted to the local statistics. For group 0 of weights 1:
- Uniform, symmetric: 2; 3 (Zhang et al., 2023, Nie et al., 2022, Ryu et al., 8 Jan 2025).
- Dynamic axis selection: In DGQ for diffusion models, the maximal range axis is automatically detected to capture outlier structure, and groups are formed by K-means on local range tuples 4, with quantization scales set accordingly (Ryu et al., 8 Jan 2025).
- Permutation-based grouping: In Permutation-COMQ, rows are permuted to cluster similar-magnitude entries per column before independent quantization, minimizing within-group scale range and maximizing dynamic resolution (Chen et al., 9 Apr 2026).
2.2 Non-uniform, Adaptive, and Codebook Approaches
- Mathematically adaptive numeric types: MANT parameterizes the group-wise quantization grid by a per-group parameter 5, allowing the quantizer to smoothly interpolate between log-uniform, power-of-two, and normal-float behavior (Hu et al., 26 Feb 2025). This is learned or selected per group based on minimizing the induced output error.
- Block/codebook clustering: BCQ (LO-BCQ) partitions blocks of weights, clusters blocks into a small set of codebooks, and quantizes each block with its corresponding codebook, yielding near-optimal MSE (Elangovan et al., 7 Feb 2025). This is iteratively optimized via block–cluster re-assignment and Lloyd–Max codebook updates.
- Gumbel-Softmax relaxation: GSQ directly learns discrete grid assignments for each weight (within a group having a shared scale) via differentiable Gumbel-Softmax relaxation, jointly optimizing per-group scales and per-coordinate grid points (Dadgarnia et al., 20 Apr 2026).
2.3 Group-wise PTQ and QAT Extensions
- Two-stage grid optimization: Stage 1 minimizes group-wise reconstruction loss using calibration activations; Stage 2 uses coordinate descent to jointly refine all group scales to minimize the full layer-wise output error, incorporating Hessian structure and propagation of quantization error from prior layers (Kim et al., 2 Feb 2026).
- DL-QAT: Assigns each group a learnable scale (quantization magnitude) and corrects remaining quantization error with a local low-rank LoRA update (trained with QAT, touching <1% parameters), yielding state-of-the-art low-bit accuracy with extreme compute efficiency (Ke et al., 12 Apr 2025).
- Kernel-wise quantization via DRL: AutoQ uses hierarchical RL to simultaneously allocate per-kernel QBN (bit number) given target accuracy/latency/energy, and can automatically discover non-uniformly quantized configurations per group (Lou et al., 2019).
3. Hardware Implications, Efficiency, and Folding Strategies
3.1 Inference-Efficient Parameterization
- Per-group scaling: Design accommodates efficient integer GEMM/conv kernels by storing per-group scales, which can be folded into downstream components—such as BatchNorm 6 in vision nets (Yu et al., 2019) or fused as runtime per-block coefficients in transformer matmuls (Hu et al., 26 Feb 2025).
- Adaptive decode efficiency: MANT’s a-param grid yields integer/shift MACs fused within systolic arrays, enabling high-throughput 4-bit inference (Hu et al., 26 Feb 2025). GSQ and group-wise scalar/clustered approaches are drop-ins for standard INT4/INT8 GEMM kernels (Dadgarnia et al., 20 Apr 2026, Elangovan et al., 7 Feb 2025).
- Dual-grained recasting: In LLMs, dual-grained quantization dequantizes groupwise INT4 weights to INT8 and performs all inference with a single INT8 kernel (CUTLASS/GPU), leveraging both groupwise accuracy and coarse-grained hardware efficiency (Zhang et al., 2023).
3.2 Training-Time and Runtime Complexity
- Group-wise parameter optimizations can be performed via:
- Closed-form/minimum error projections (Kim et al., 2 Feb 2026, Chen et al., 9 Apr 2026)
- Lightweight K-means (blocks/ranges/statistics) (Ryu et al., 8 Jan 2025, Nie et al., 2022)
- Multi-objective or RL-based search (when joint hardware and accuracy optimization required) (Lou et al., 2019).
Compared to layer-wise approaches, group-wise variants introduce modest additional parameters (e.g., per-group scale, small codebook/parameter tables), but have negligible runtime or memory impact on contemporary hardware for group sizes 7 (Hu et al., 26 Feb 2025, Elangovan et al., 7 Feb 2025).
4. Empirical Performance and Trade-offs
Extensive experiments across LLMs, vision, and generative models consistently establish that group-wise or block-quantization is a dominant regime for sub-8-bit and especially sub-4-bit quantization, with notable findings:
- Vision: On ResNet-18 (CIFAR-100, 2-bit), per-filter (gs=1) groupwise quantization attains 71.3% top-1 (float: 73%) versus 64.9% for layerwise (Yu et al., 2019). AdderNet, with group-quantization and lossless clamp/outlier handling, recovers 66.5% top-1 at 4-bit PTQ, outperforming global scaling by 8.5 points (Nie et al., 2022).
- LLMs: Groupwise INT4 in LLaMA-7B improves PPL from 6.85 (channelwise) to 5.8 (group size 128), while MANT-W4A8 further narrows the gap to FP16 (PPL 5.79) (Hu et al., 26 Feb 2025). Sophisticated grid or codebook approaches close the delta to “vector quantization” at 8 as in BCQ (LO-BCQ) and GSQ (2520.05376, Dadgarnia et al., 20 Apr 2026). Two-stage scale optimization provides PPL and accuracy enhancements over GPTQ at 2–3 bits (Kim et al., 2 Feb 2026).
- Diffusion & Generative Models: DGQ with dynamic axis/group selection and prompt-specific log quantization attains 9 in W8A8 StableDiffusion, matching FP. At W4A6, using 16 groups, FID drops from 0 to 43.66 (baseline), with group-wise DJQ at 0.263 CLIP (vs 0.127 baseline) (Ryu et al., 8 Jan 2025).
Key trade-offs:
- Group size: Smaller groups (32–128) capture more local variability but increase metadata/storage; larger groups have lower overhead but higher error.
- Codebook/parameter overhead: Codebook-based methods (BCQ) and per-group-adaptive types (MANT) trade extra per-group parameters for significant accuracy at ultra-low bitwidths.
- Hardware interface: Selection of quantization format and group size must balance between inference kernel compatibility, scale lookup efficiency, and accelerator/ASIC resource usage.
5. Algorithmic and Implementation Workflows
5.1 Unified Pipelines
The group-wise quantization process involves several typical algorithmic phases:
| Phase | Description | Representative Approaches |
|---|---|---|
| 1. Grouping/Clustering | Partition weights by axis, clustering, or contiguous blocks | (Ryu et al., 8 Jan 2025, Nie et al., 2022, Zhang et al., 2023, Chen et al., 9 Apr 2026) |
| 2. Scale/Parameter search | Determine per-group quantizer scales or codebooks (possibly via output-layer error minimization) | (Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025) |
| 3. Quantization | Apply group-wise quantization, coordinate/codebook lookup, and optional outlier correction or clamp | (Yu et al., 2019, Nie et al., 2022, Ke et al., 12 Apr 2025) |
| 4. Post-processing | Fold scales into downstream normalization/Bias (e.g., BatchNorm 1 adjustment), group-aware dequantization logic | (Yu et al., 2019, Hu et al., 26 Feb 2025) |
| 5. (Optional) QAT | Fine-tune group magnitudes and LoRA low-rank correction for error minimization in the deployment regime | (Ke et al., 12 Apr 2025) |
Pseudocode snippets for selected approaches can be found in (Yu et al., 2019, Ryu et al., 8 Jan 2025, Nie et al., 2022, Kim et al., 2 Feb 2026, Elangovan et al., 7 Feb 2025), formally specifying the groupwise quantization, block clustering, and scale/codebook updates.
5.2 Outlier Handling and Lossless Clamp
Handling outliers by groupwise clustering (K-means on range, magnitude permutation (Ryu et al., 8 Jan 2025, Chen et al., 9 Apr 2026)), lossless clamping plus bias compensation (Nie et al., 2022), or activation percentile-based squeezing (Zhang et al., 2023) is central to preventing representational collapse with global scaling.
5.3 Hardware Folding and Efficient Execution
Folding per-group scale factors into BatchNorm parameters during inference ensures parameter-free runtime cost, as shown in GDRQ (Yu et al., 2019). Parallel systolic hardware paths, as in MANT, enable high-throughput 4-bit MACs via simultaneous accumulation/shift logic (Hu et al., 26 Feb 2025).
6. Applications, Impact, and Best Practices
Group-wise weight quantization is adopted across a diversity of application areas:
- Computer vision: ResNet, VGG, and AdderNet architectures—group-wise approaches consistently deliver higher accuracy at 2–4 bit regimes than layer/row-wise approaches (Yu et al., 2019, Nie et al., 2022).
- LLMs and transformers: Standard for sub-8-bit quantization in LLaMA, OPT, GPT-3, Kimi-K2.5, and Mixture-of-Experts models, with kernels and quantization toolkits directly supporting group/block formats (Elangovan et al., 7 Feb 2025, Dadgarnia et al., 20 Apr 2026, Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026).
- Diffusion and generative models: Distribution-aware group selection achieves state-of-the-art low bitwidth deployment without architectural surgery or retraining (Ryu et al., 8 Jan 2025).
- Medical foundation models: Permuted group-wise quantization delivers SOTA DSC and NSD in calibration-only settings (Chen et al., 9 Apr 2026).
Best practices:
- For sub-8-bit deployment, 2–3 is widely adopted, balancing metadata and quantization error (Hu et al., 26 Feb 2025, Elangovan et al., 7 Feb 2025).
- Outlier-aware grouping and per-group parameterization are crucial for regimes below 4 bits (Ryu et al., 8 Jan 2025, Nie et al., 2022).
- For mixed-precision and hardware-optimized inference, group-wise dequantization and folding should align with kernel organization (row/column blocks, scale folding).
7. Limitations, Extensions, and Ongoing Research
- Overhead: Small group sizes increase metadata per parameter; however, this is negligible (e.g., 0.1% model size at 16 groups/layer for UNet (Ryu et al., 8 Jan 2025); single-parameter 4 and scale per group for MANT (Hu et al., 26 Feb 2025)).
- Activation quantization: Most approaches treat activations separately; joint group-wise schemes are area of active research (Zhang et al., 2023, Elangovan et al., 7 Feb 2025).
- Extensions: K-means and permutation strategies can generalize to blocks of arbitrary shapes and even non-linear group selection (e.g., attention outlier detection in DGQ) (Ryu et al., 8 Jan 2025). Unifying runtime scale/shift logic is ongoing in LLM accelerator design (Hu et al., 26 Feb 2025).
- Codebook/parameter LUTs: Clustering methods with per-group codebooks introduce ROM/codebook lookup paths, which must be balanced against kernel simplicity (Elangovan et al., 7 Feb 2025).
- Task transfer: Empirical evidence suggests that group-wise schemes often generalize across tasks and domains, and can be composed with QAT or LoRA/post-training finetuning for further accuracy boosts (Ke et al., 12 Apr 2025, Nie et al., 2022).
Group-wise and block quantization continue to evolve as the empirical and algorithmic backbone for ultra-low bitwidth neural network deployment across modalities and scales, underlying both academic innovation and industry platforms (Yu et al., 2019, Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026, Dadgarnia et al., 20 Apr 2026).