Multi-Group Quantization (MGVQ)
- Multi-Group Quantization is a method that partitions tensor values into small groups, each with its own quantization parameters to reduce error and optimize rate–distortion trade-offs.
- It generalizes traditional layerwise and channelwise quantization by using both static and dynamic grouping techniques, significantly improving model calibration in LLMs and vision transformers.
- MGVQ enhances hardware efficiency and inference performance by achieving near full-precision accuracy at low bitwidths and scalable capacity in generative models.
Multi-Group Quantization (MGVQ) is a set of quantization paradigms in which tensor values—weights, activations, or latents—are partitioned into small "groups," with each group assigned its own quantization parameters, codebook, or numeric type. This approach generalizes the classic layerwise/channelwise quantization, enabling fine-grained adaptation to local statistics, and is now foundational across efficient LLM inference, vision transformer calibration, and vector quantized generative models. MGVQ can employ uniform or non-uniform grids, instance-dependent grouping, and dynamic parameterization to optimize rate–distortion trade-offs and hardware efficiency.
1. Fundamental Concepts and Mathematical Framework
MGVQ partitions a vector or tensor into disjoint groups , each of size (), with independent quantization parameters per group. The general groupwise quantization mapping, as used in LLM quantization, is
where is a trainable group scale and denotes rounding and clamping to the target bitwidth. In generative modeling contexts, groupwise quantization may instead use sub-codebooks per group, with each input vector split into sub-vectors, each quantized independently via assignment to the closest codebook vector (Jia et al., 10 Jul 2025, Zheng et al., 15 Oct 2025).
The efficacy of MGVQ relies on the observation that local statistics—such as dynamic range or variance—can vary significantly even across small groups. Adapting parameters to each group reduces quantization error and prevents information loss that would otherwise be incurred by forced global quantization.
2. MGVQ in Parameter and Activation Quantization
Fine-grained groupwise quantization is a de facto standard in low-bit LLM inference (Hu et al., 26 Feb 2025, Kim et al., 2 Feb 2026), as well as in post-training calibration for vision transformers (Moon et al., 2024). The core motivations and methodologies include:
- LLM Groupwise Schemes: LLMs commonly use group sizes or per output channel. Each group is assigned a scale and potentially a non-uniform grid to match its range and distribution. Recent LLM accelerators also quantize KV caches groupwise, with real-time group parameter selection (Hu et al., 26 Feb 2025).
- Vision Transformer Activation Quantization: IGQ-ViT introduces instance-aware dynamic group splitting, where both activations and attention maps are divided into groups for each input instance. Group assignments are updated with an EM-style algorithm to minimize distributional discrepancies and optimize uniform quantizer fit (Moon et al., 2024).
- Groupwise Grid Optimization: Two-stage methods further minimize reconstruction loss by first initializing group scales with an input-aware local objective, then refining all scales via coordinate descent on the global layerwise loss, using block-partitioned input Hessians (Kim et al., 2 Feb 2026).
Table 1: Representative MGVQ Methods in Parameter/Activation Quantization
| Method | Group Approach | Assignment |
|---|---|---|
| MANT (Hu et al., 26 Feb 2025) | Static, 64-weights | Grid search on MSE |
| Two-Stage (Kim et al., 2 Feb 2026) | Static, 32/64 | Loss minimization |
| IGQ-ViT (Moon et al., 2024) | Dynamic, per-input | EM-based, BOP-constrained |
3. Multi-Group Vector Quantization in Representation Learning
In vector-quantized generative models such as VQ-VAEs and VQGANs, MGVQ replaces a single monolithic codebook with sub-codebooks ("multi-group vector quantization"), each operating on a sub-vector of encoder outputs (Jia et al., 10 Jul 2025, Zheng et al., 15 Oct 2025). The encoder output is split into chunks of size , and each chunk is quantized independently:
with each sub-codebook of size . The quantized codes are concatenated. This increases representational capacity ( combinations) without incurring codebook collapse or excessive per-codebook dimensionality.
Group-VQ (Zheng et al., 15 Oct 2025) generalizes this to allow groupwise optimization of codebook segments, enabling higher code utilization and post-hoc codebook resampling or self-extension for capacity tuning.
Table 2: MGVQ in VQ-VAEs
| System | No. Groups | Codebook Size per Group | Total Capacity |
|---|---|---|---|
| MGVQ-G4 (Jia et al., 10 Jul 2025) | 4 | 8192 | |
| MGVQ-G8 (Jia et al., 10 Jul 2025) | 8 | 2048 | |
| Group-VQ (Zheng et al., 15 Oct 2025) |
4. Algorithmic Techniques for Group Assignment and Parameterization
MGVQ covers a spectrum of assignment mechanisms:
- Static Partitioning: Fixed group boundaries (by position, channel count, or latent dimension).
- Dynamic Grouping: Assignment at runtime based on per-instance statistics, using distance measures (e.g., min/max spread or KL divergence in output space) and iterative EM steps (Moon et al., 2024).
- Groupwise Grid/Codebook Optimization: Search or descent algorithms per group for optimal quantizer parameters (scales, grid coefficients, codebooks), with group-specific calibration loss, often involving grid search or closed-form updates (Kim et al., 2 Feb 2026, Hu et al., 26 Feb 2025).
- Within-Group Joint Adaptation: Learned affine projectors shared within groupwise codebooks to capture heterogeneous data subdistributions (Zheng et al., 15 Oct 2025).
Dynamically assigned groups can be optimized subject to computational or hardware constraints, such as limiting extra Bit-Operations (BOP) via integer programming (Moon et al., 2024).
5. Hardware and Inference Considerations
MGVQ is central in hardware-efficient model deployment:
- Low-cost Metadata Storage: Scales or group identifiers are small, often amortized over large tensor tiles.
- Fused Dequantization: Hardware pipelines (e.g., systolic arrays) can interleave decode and compute operations where group parameters are locally buffered, as in MANT (Hu et al., 26 Feb 2025).
- No Inference Cost Overhead: Approaches such as GDRQ (Yu et al., 2019) merge per-group scales into BatchNorm during inference, retaining hardware simplicity.
- Dynamic Real-Time Support: For LLMs, real-time groupwise quantization is deployed for KV cache updates, requiring lightweight metadata computation and buffering (Hu et al., 26 Feb 2025).
6. Empirical Performance and Trade-offs
MGVQ methods consistently report accuracy close to full-precision baselines at aggressive bitwidths (4 bits), often with minimal overhead:
- LLMs: W4A8 groupwise quantization with adaptive types (MANT) yields 0.12 PPL loss versus float on LLaMA and OPT, and energy reduction over state-of-the-art accelerators (Hu et al., 26 Feb 2025). Two-stage optimization narrows the performance gap to full precision at INT3 (Kim et al., 2 Feb 2026).
- Vision Transformers: IGQ-ViT achieves \% top-1 gains over prior PTQ methods at 4 bits, with BOP overhead (Moon et al., 2024).
- Generative Models: MGVQ-G8 reaches and PSNR=24.70 on ImageNet, outperforming both classic VQ-GANs and continuous-latent VAEs. Ablation shows near-100\% code utilization and stable training even at large capacity (Jia et al., 10 Jul 2025, Zheng et al., 15 Oct 2025).
Smaller group size typically improves fidelity at the cost of additional metadata and potential statistical instability. The number of groups is commonly tuned to maximize utilization and minimize distortion (Zheng et al., 15 Oct 2025).
7. Limitations, Variants, and Future Directions
MGVQ frameworks assume that efficient folding of groupwise parameters into post-processing (such as BatchNorm) is available, and require access to per-group statistics during training or calibration. In some cases, very small groups may induce noisy estimates or excessive shape overhead (Yu et al., 2019). Support for dynamic assignment on-device introduces marginal hardware complexity (Moon et al., 2024).
Variants include instance-aware grouping, hierarchical grouping strategies, adaptive selection of group count per layer, and integration of MGVQ with codebook resampling or self-extension for post-hoc capacity scaling (Zheng et al., 15 Oct 2025). Extending MGVQ concepts to other model components (e.g., LayerNorm, MLP outputs), and tightly coupling group assignment with quantization-aware training, represent ongoing areas of research (Moon et al., 2024, Jia et al., 10 Jul 2025).
References:
- (Hu et al., 26 Feb 2025) "M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type"
- (Kim et al., 2 Feb 2026) "Two-Stage Grid Optimization for Group-wise Quantization of LLMs"
- (Moon et al., 2024) "Instance-Aware Group Quantization for Vision Transformers"
- (Yu et al., 2019) "GDRQ: Group-based Distribution Reshaping for Quantization"
- (Jia et al., 10 Jul 2025) "MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization"
- (Zheng et al., 15 Oct 2025) "Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models"