Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hybrid-Granularity Quantization (InfiR2)

Updated 17 April 2026
  • Hybrid-granularity quantization (InfiR2) is a method that combines block-wise quantization for weights and token-wise quantization for activations to optimize FP8 training.
  • It leverages differing scale computation strategies—using UE8M0 rounding—to maintain numerical fidelity and reduce runtime overhead in transformer models.
  • Empirical results demonstrate near-lossless performance with improved throughput and reduced memory usage compared to traditional quantization schemes.

Hybrid-granularity quantization (“InfiR2”) refers to quantization schemes that apply different quantization granularities within deep neural networks, most prominently using block-wise quantization for weights and token- or row-wise quantization for activations. Such mixed-granularity strategies are designed to maximize the efficiency gains of aggressive quantization (e.g., in FP8 training) while minimizing the associated degradation of numerical fidelity and downstream model quality. The “InfiR2” methodology is characterized by its systematic integration of per-block and per-token quantization in the context of transformer-based LLMs, and is generalizable across both training and inference regimes for a variety of network architectures (Wang et al., 26 Sep 2025).

1. Motivation and Rationale for Hybrid-Granularity Quantization

Uniform granularity quantization (per-tensor) is known for architectural simplicity and implementation efficiency but fails to mitigate the adverse effects of large outlier magnitudes or highly variable activation ranges, which are especially pronounced in modern architectures such as transformers. Per-element or per-row quantization is optimal from a fidelity perspective but introduces substantial runtime and memory overheads due to scale management and extra storage.

The InfiR2 strategy addresses these limitations by employing block-wise quantization for “static” parameters such as network weights (partitioned into small fixed-size blocks, each assigned its own scale), and token-wise (per-row) quantization for “dynamic” activations (assigning unique scales to each token’s feature vector in a minibatch). This hybrid allocation of quantization granularity is grounded in the empirical distribution properties of neural network data and leverages hardware capabilities, for example, aligning with NVIDIA’s DeepGEMM FP8 matmul kernels (Wang et al., 26 Sep 2025).

2. Quantization Granularity and Scaling Procedures

The InfiR2 quantization procedure is distinguished by two main axes of granularity:

  • Block-wise (per-block) quantization for weights: The weight matrix WRm×nW \in \mathbb{R}^{m \times n} is divided into contiguous blocks BjB_j of size bs×bsbs \times bs. Each block is quantized independently via a symmetric, zero-point-free scheme.
  • Token-wise (per-row) quantization for activations: Activations ARseq_len×dA \in \mathbb{R}^{\text{seq\_len} \times d} are treated as a sequence of row vectors, each corresponding to a token, with separate scale computations per row.

The scaling factor for a quantization unit (block or row) is determined as

S=maxiXiVmax,S = \frac{\max_i |X_i|}{V_\mathrm{max}},

where VmaxV_\mathrm{max} is the largest representable value in the FP8 target format. Quantization is performed as

Q(x)=clamp(round(x/S),Qmax,Qmax),Q(x) = \mathrm{clamp}(\mathrm{round}(x / S), -Q_\mathrm{max}, Q_\mathrm{max}),

with dequantization by multiplication with SS.

For stability and efficiency, InfiR2 implements the “UE8M0” scaling-factor rounding, ensuring that all computed scales are powers of two. This is achieved by computing the exponent of the floating-point scale, rounding it up, clamping within allowable bounds, and reconstructing SS as 2exponent2^{\text{exponent}}. This rounding prevents under-scaling and clipping, while introducing negligible quantization noise.

Scale factors are stored in an 8-bit exponent-only format (UE8M0) and can be broadcast efficiently during dequantization within hardware-accelerated kernels. Weight gradients and optimizer states remain in FP32. Activations and activation gradients are quantized as per-token with the same recipe used in the forward pass (Wang et al., 26 Sep 2025).

3. End-to-End Workflow and Algorithmic Integration

The hybrid-granularity quantization pipeline as instantiated in InfiR2 consists of:

Forward Pass

  • For each transformer layer:
    • Quantize FP32 master weights to FP8 using per-block scales.
    • Quantize FP32 activations to FP8 using per-token scales.
    • Perform GEMM in FP8 (using hardware kernels such as MatMul_FP8).
    • Dequantize outputs by multiplying corresponding block-wise and token-wise scales.

Backward Pass

  • Quantize activation gradients per-token using forward pass scales.
  • Input gradients and weight gradients are similarly obtained by GEMM in FP8, with accumulation into FP32 master buffers.
  • Optimizer updates and momenta are maintained in FP32, consistent with standard mixed-precision training practices.

This process is applied without fallback to higher-precision formats at any stage, covering both continual pre-training (e.g., over 160B tokens) and multi-stage supervised fine-tuning (e.g., InfiAlign-SFT). No explicit mixed-precision fallback is required (Wang et al., 26 Sep 2025).

4. Empirical Performance: Fidelity and Computational Efficiency

Extensive validation demonstrates that InfiR2’s hybrid-granularity FP8 quantization achieves near lossless parity with BF16 baselines across multiple tasks and architectures:

  • Numerical fidelity: Training and validation loss curves for FP8 are visually indistinguishable from BF16 for pre-training and SFT runs up to 160B tokens on 1.5B–7B parameter models.
  • Downstream evaluation: Across math reasoning and QA benchmarks (AIME24, GPQA), InfiR2 FP8 models match or outperform their BF16 counterparts by 1–2 accuracy points, typically within metric noise.
  • Efficiency: Wall-clock time reductions of 7–22%, peak memory usage reduced by 5–14%, and throughput increases of up to 19% are realized, attributed to the lower memory footprint and improved utilization of FP8 TensorCores.

Empirical results support the claim that block-wise and token-wise quantization, as opposed to coarser (per-tensor) or costlier (per-element) alternatives, provides an optimal trade-off between computational efficiency and preservation of model fidelity (Wang et al., 26 Sep 2025).

5. Comparative Analysis with Alternative Quantization Schemes

The InfiR2 hybrid-granularity approach occupies a strategic point between pure per-tensor quantization, which is fast but sacrifices accuracy due to outliers, and full per-element quantization, which is precise but resource-intensive. Key contrasts include:

  • Uniform per-tensor FP8 quantization: Faster to compute, but susceptible to quantization noise in the presence of large-magnitude outliers in weights or activations, resulting in accuracy loss.
  • Full per-token FP8 quantization for all tensors: Highest fidelity but impractical for large models due to the explosion in scale metadata and computation.
  • BF16 and mixed-precision: Double the memory cost per tensor and halved throughput compared to FP8 on compatible hardware.

A plausible implication is that hybrid-granularity quantization will remain preferable as long as hardware supports efficient scale broadcasting and as long as block sizes are chosen to balance hardware kernel alignment and numerical sensitivity.

6. Practical Considerations and Limitations

  • Overhead: The necessity of two sets of scale computations (block for weights, token for activations) introduces minimal but nonzero runtime overhead (~5% as reported).
  • Hardware dependency: Effective utilization depends on GPU architectures supporting block-wise and token-wise FP8 (e.g., NVIDIA Blackwell/Hopper families) with efficient kernel implementations.
  • Hyperparameter sensitivity: For very large models (e.g., 100B+ parameters), further tuning of schedules or clamping ranges may be required, although the published results indicate strong stability in the 1.5B–7B regime.

Hybrid-granularity quantization as formalized in the InfiR2 recipe generalizes beyond LLMs. Analogous mixed-granularity schemes are prominent in network compression pipelines using meta-learning (e.g., layer-wise hybrid bitwidth via MetaQuantNet and genetic search), as well as in post-training quantization of vision transformers where mixed loss granularities are fused for robustness (e.g., MGRQ) (Wang et al., 26 Sep 2025, Wang et al., 2020, Yang et al., 2024). The concept is also evident in contrastive retrieval systems, where hybrid codebooks at multiple levels improve both recall and efficiency (Wang et al., 2022).

The approach’s success in sustaining accuracy and accelerating large-scale neural network training at reduced cost suggests a likely trajectory toward standardization in FP8-centric model development pipelines, contingent on continued hardware evolution. The absence of formal optimality guarantees and sensitivity for extremely large-scale deployments mark areas of ongoing research (Wang et al., 26 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid-Granularity Quantization (InfiR2).