Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Group Quantization (HGQ)

Updated 22 December 2025
  • HGQ is a quantization framework that uses coarse-grained FP16 scaling and fine-grained integer shifts to optimize low-bit transformer inference.
  • It minimizes costly FP operations by applying a hierarchical dual-level scaling, resulting in significant energy and area reductions.
  • Empirical results on transformer workloads demonstrate that HGQ retains high accuracy while reducing dequantization energy by over 36% and hardware area by 20%.

Hierarchical Group Quantization (HGQ) is a quantization framework designed to optimize low-bit (e.g., INT4) inference in transformer-based models. By introducing a dual-level scaling hierarchy—combining coarse-grained floating-point scaling and fine-grained shift-only corrections—HGQ maintains the accuracy of small-group quantization while significantly reducing computational and memory overhead. The methodology is formalized and evaluated in the context of the SeVeDo heterogeneous accelerator, where HGQ drives notable energy and area reductions without incurring substantial accuracy loss (Choi et al., 15 Dec 2025).

1. Motivation and Problem Setting

Aggressive quantization of transformer activations to low bitwidths (such as INT4) is highly effective for reducing the memory footprint and arithmetic workload associated with large models. However, this approach is sensitive to activation outliers, which can cause severe degradation in inference accuracy when naïvely applying a single global quantization scale. Conventional group quantization mitigates this by dividing activations into smaller groups, each with its own floating-point (FP) scale, thus providing tighter dynamic range bounds per group. This improves accuracy but increases the number of INT-to-FP operations, each exponentially more expensive in energy and area than their integer counterparts—up to 67.5× the power and 49.3× the area of an INT4 MAC.

HGQ addresses this conflict by combining two levels of scaling:

  • Coarse-grained Base Scaling Factor (BSF): FP scaling at a large group size GbaseG_{base} (e.g., 128 elements).
  • Fine-grained Exponent-Shifted Scaling Factor (ESSF): Per-subgroup integer bit-shifts at GsubG_{sub} (e.g., 32 elements), implemented as shifts relative to BSF.

This hierarchical approach is designed to minimize the frequency and cost of FP operations while retaining the dynamic-range adaptivity of small-group quantization (Choi et al., 15 Dec 2025).

2. Hierarchical Group Quantization: Algorithm and Formalization

Let aRna \in \mathbb{R}^n denote the activation vector for quantization. The formal steps of HGQ are as follows:

  1. Partitioning:
    • aa is divided into n/Gbase\lfloor n/G_{base} \rfloor base-groups of size GbaseG_{base}.
    • Each base-group is further split into k=Gbase/Gsubk = G_{base}/G_{sub} sub-groups of size GsubG_{sub}.
  2. Base Scaling Factor (BSF): For each base-group gg,

Sbase=maxigai2b11S_{base} = \frac{\max_{i \in g} |a_i|}{2^{b-1}-1}

where bb is the quantization bitwidth (e.g., b=4b=4, Qmax=7Q_{max} = 7). SbaseS_{base} is stored in FP16.

  1. Exponent-Shifted Scaling Factor (ESSF): For each sub-group jj within base-group gg,

mj=maxisubgroupjai Ej=clip(log2(mj/Sbase),Emax,Emax)m_j = \max_{i \in \text{subgroup}_j} |a_i| \ E_j = \mathrm{clip} \left( \left\lfloor \log_2(m_j/S_{base}) \right\rceil, -E_{max}, E_{max} \right)

EjE_j (at 2 bits) yields an integer in [2,2][-2, 2].

  1. Quantization: The effective subgroup scale is Sj=Sbase2EjS_j = S_{base} \cdot 2^{E_j}.

qj,i=clip(aj,iSj,Qmax,Qmax)q_{j,i} = \mathrm{clip} \left( \left\lfloor \frac{a_{j,i}}{S_j} \right\rceil, -Q_{max}, Q_{max} \right)

  1. Two-Phase Dequantization and Accumulation:
    • Phase 1 (Integer domain):

    αj=isubj(qj,iEj)\alpha_j = \sum_{i \in \text{sub}_j} (q_{j,i} \ll E_j)

    where "Ej\ll E_j" denotes bitwise left-shifting by EjE_j (efficient barrel shift). - Phase 2 (FP domain):

    yg=Sbasej=1kαjy_g = S_{base} \cdot \sum_{j=1}^{k} \alpha_j

    Only one FP16×INT multiplication per base-group is required—contrasting with kk such operations under traditional group quantization.

3. Hardware Integration and Residual Matrix Core (RMC)

HGQ operates synergistically with the Residual Matrix Core (RMC) architecture, an INT4 tensor processing array extended with a Hierarchical Quant Unit:

  • Processing Elements (PEs):

    • Each PE accepts INT4 qj,iq_{j,i} inputs and applies programmable barrel shifts (according to EjE_j).
    • Accumulation is performed in 8–10 bit integer registers.
    • Once sub-groups are processed, a single FP16 multiplication implements the base-group scale.
  • Control Logic:
    • At group start, SbaseS_{base} is loaded into the local FP16 multiplier.
    • For each sub-group, EjE_j is loaded into the shifter, and quantized values are accumulated.
    • After processing all sub-groups in a base-group, accumulation transitions to the FP16 × INT operation.

The net effect is that only one FP16×INT dequantization per GbaseG_{base} activations is required, as opposed to per GsubG_{sub} in standard group quantization. Shift logic is substantially cheaper in power and area than FP multiplication. Minor approximation error from ESSF is tightly bounded and rendered negligible after upstream SVD outlier suppression (Choi et al., 15 Dec 2025).

4. Quantitative Evaluation: Performance, Energy, and Area

Empirical results on transformer workloads demonstrate:

Method (INT4, FP scaling) Energy (rel.) Area (rel.) Acc./PPL (ViT Top-1% / Llama PPL)
G=32 / FP16 per-group 100% 100% 84.27 / 6.03
HGQ (G_sub=32, G_base=128, E2/FP16) 63.9% 80.0% 84.18 / 6.14

HGQ reduces dequantization energy by 36.1% and hardware area by 20.0% relative to conventional per-group FP16 scaling, with accuracy loss less than 0.12 PPL (Llama, Wikitext-2) and 0.1% Top-1 (ViT, ImageNet-21k). Across multiple model sizes, HGQ matches or outperforms other multi-scale quantization schemes (e.g., MXFP4, MXINT4, NVFP4) at lower resource cost (Choi et al., 15 Dec 2025).

5. Analysis of Algorithmic and Hardware Trade-Offs

HGQ structurally decouples precision management via a hybrid scaling approach:

  • Accuracy Retention: BSF delivers exact scaling for the dominant dynamic range per base-group, while ESSF introduces only minor, bounded approximation through inexpensive shifts.
  • Dequantization Cost Reduction: Compared to fine-grained per-group FP16 scaling, the vast majority of floating-point multiplies are replaced by shifts and integer operations; only a $1/k$ fraction of multiplications remain.
  • Impact of SVD Preprocessing: SVD upstream mitigates large activation outliers, ensuring that mjm_j within any sub-group remains close to SbaseS_{base}, thereby minimizing the quantization error introduced by shift-only ESSF.

The overall result is that INT4-level arithmetic throughput is achieved at a much lower energy/area cost, with inference accuracy retained virtually at the level of full-precision or fine-grained FP-scaling alternatives.

6. Context, Comparisons, and Implications

HGQ achieves its results within SeVeDo, a heterogeneous SVD-based transformer accelerator, but the quantization regime is not architecture-specific. The hierarchical quantization process is compatible with a wide range of quantized transformer backends, provided upstream outlier suppression (e.g., via SVD) ensures well-bounded subgroup maxima. This suggests applicability in both language (Llama) and vision (ViT) models at low bitwidths.

A key insight is that hardware–algorithm co-design is central: the two-phase dequantization flow in RMC aligns exactly with the algorithmic structure of HGQ, maximizing resource utilization and energy efficiency. A plausible implication is that similar hierarchical quantization–shifter schemes can generalize to other linear algebra workloads where outlier-sensitive quantization is needed, given appropriate dynamic range suppression (Choi et al., 15 Dec 2025).

7. Summary and Prospective Considerations

Hierarchical Group Quantization is a principled grouping and scaling method for low-bit deep learning inference. It combines exact per-base-group FP16 scaling with ultra-cheap exponent shifts over small subgroups, reducing dequantization overhead by over a third in energy and a fifth in area compared to conventional per-group FP scaling, all while preserving the high accuracy associated with fine-grained group quantization (Choi et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hierarchical Group Quantization (HGQ).