Papers
Topics
Authors
Recent
Search
2000 character limit reached

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Published 19 Apr 2025 in cs.AR and cs.LG | (2504.14152v1)

Abstract: Quantization is a powerful tool to improve LLM inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining <1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory.

Summary

  • The paper introduces a fine-grained mixed-precision quantization method that leverages Fisher information to selectively retain critical weights and activations in high precision (FP8), achieving less than 1% perplexity degradation.
  • It proposes a sensitivity-weighted clipping approach and a dynamic precision assignment via a specialized post-processing unit to minimize quantization errors.
  • The custom hardware design, featuring a VMAC datapath and tailored acceleration architecture, delivers 14% energy savings and 30% reduction in weight memory usage while maintaining near-original performance.

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Introduction

The efficient inference of LLMs is increasingly critical due to their rapid growth in size and computational demands. Quantization offers a promising approach to enhance inference efficiency by leveraging low-precision datapaths, resulting in reduced energy consumption and memory footprint. However, standard low-precision quantization can degrade model accuracy, particularly in scenarios requiring post-training quantization (PTQ) where retraining is impractical. The paper "FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference" (2504.14152) addresses these challenges by introducing fine-grained mixed-precision (FGMP) quantization along with custom hardware support.

FGMP Quantization Methodology

The FGMP approach targets both weight and activation quantization at a fine-grained level, optimizing LLM inference without sacrificing model accuracy. Critical to FGMP is its policy for precision assignment, which uses sensitivity information based on Fisher information to identify blocks of weights and activations that should be retained in a higher precision format (FP8), while the remaining blocks are quantized to a lower precision format (NVFP4). Figure 1

Figure 1: Perplexity degradation versus compression rate for Llama-2-7B with 4-bit quantization, demonstrating FGMP's superior performance over existing methods.

This sensitivity-driven policy prioritizes blocks according to their impact on the model's output loss, ensuring that only essential blocks are kept in high precision. Additionally, FGMP incorporates a sensitivity-weighted clipping approach, further enhancing the representation accuracy of low-precision blocks by adjusting scaling factors to minimize quantization errors. Figure 2

Figure 2: FGMP quantization strategy shown at block granularity, highlighting mixed-precision assignment for Layer 7 FC1 in the Llama-2-7B model.

Hardware Support for FGMP

The realization of FGMP's efficiency gains necessitates complementary hardware enhancements. The proposed hardware design includes a Vector Multiply-Accumulate (VMAC) datapath with support for executing mixed-precision operations at block granularity. This facilitates the use of low-precision computation for the majority of operations, with energy-efficient datapath choices based on the precision distribution (Figure 3). Figure 3

Figure 3: High-level architecture featuring a PE array and a post-processing activation quantization unit for FGMP.

A specialized post-processing unit (PPU) is designed to dynamically allocate precisions to activation blocks, based on real-time sensitivity evaluations, thus enabling optimal precision assignment during inference without substantial runtime overhead. Figure 4

Figure 4: Post-processing unit diagram for dynamic mixed-precision quantization of activations.

Experimental Results

Comprehensive evaluations demonstrate FGMP's effectiveness in maintaining model accuracy (perplexity) across several LLM architectures, such as Llama-2 and GPT3, while significantly reducing energy consumption and memory usage. The FGMP approach, when applied to the Llama-2-7B model, results in less than 1% perplexity degradation compared to FP8 across standard benchmarks like Wikitext-103, while achieving 14% energy savings and 30% weight memory reduction. Figure 5

Figure 5: Wikitext-103 perplexity results for various models and precision settings under FGMP quantization.

Conclusion

FGMP represents a significant advancement in LLM quantization, striking a balance between performance and computational efficiency. The synergy of sensitivity-aware quantization and hardware-level innovations empowers LLMs to deliver near-original performance with markedly reduced energy and memory footprints. As LLMs become more pervasive, the implications of such advancements cannot be understated, paving the way for more sustainable and scalable AI deployments. Future work may explore extensions to other model architectures and further optimize hardware designs for even more nuanced precision adjustments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.