Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 23 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 183 tok/s Pro

2000 character limit reached

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models (2508.02343v1)

Published 4 Aug 2025 in cs.LG and cs.AI

Abstract: Quantization significantly accelerates inference in LLMs by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, LLMing, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.

Collections

Summary

The paper introduces a co-designed mixed-precision quantization algorithm with channel-wise adaptive precision assignment to minimize quantization error.
It fuses reorder-and-quantize within GEMM kernels, leveraging FP4 Tensor Cores on NVIDIA Blackwell GPUs for significant acceleration.
The method achieves competitive accuracy (over 95% of FP16 on zero-shot tasks) and scalability across diverse LLM applications and hardware platforms.

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for LLMs

Introduction

MicroMix introduces a co-designed mixed-precision quantization algorithm and matrix multiplication kernel tailored for NVIDIA's Blackwell architecture, leveraging Microscaling (MX) data formats (MXFP4, MXFP6, MXFP8). The framework addresses the limitations of existing INT4-based quantization methods, which are unable to fully exploit the throughput of FP4 Tensor Cores due to format incompatibility. MicroMix achieves adaptive precision allocation at the channel level, guided by quantization error thresholds, and integrates a fused reorder-and-quantize operation to maximize kernel efficiency. The method demonstrates competitive or superior accuracy and significant speedups across diverse LLM tasks and hardware platforms.

Figure 1: MicroMix assigns bit-widths to channels based on absolute mean values, with online channel reordering for activations and offline reordering for weights.

Motivation and Design Principles

Adaptive Mixed-Precision Allocation

Existing mixed-precision quantization methods (e.g., Atom, QuaRot) employ a fixed number of high-precision channels per layer, failing to account for the heterogeneity in activation distributions across layers. MicroMix introduces a layer-wise adaptive allocation strategy, assigning higher precision to channels with larger absolute mean values, thereby minimizing quantization error and preserving model accuracy.

Figure 2: Channel-wise mean values of Llama3.1-8B activations, with outlier channels reordered; MicroMix adaptively assigns higher precision to outlier channels.

Hardware-Aware Kernel Efficiency

INT-based quantization kernels are bottlenecked by dequantization operations on CUDA Cores, which consume up to 85% of execution time on Blackwell GPUs. MicroMix leverages FP4 Tensor Cores, enabling direct dequantization and fused GEMM operations, resulting in substantial computational efficiency gains.

Figure 3: Runtime breakdown of matrix multiplication kernels on RTX 5090, A100, and RTX 4090; MicroMix fuses dequantization within GEMM, outperforming INT-based kernels.

Quantization Error Management

MicroMix defines explicit quantization thresholds for MXFP4 and MXFP6 formats, ensuring that channels assigned to lower precision do not exceed the quantization error upper bound of INT8. Channels violating these thresholds are reassigned to higher precision formats, effectively suppressing outlier-induced degradation.

Algorithmic Framework

Channel Partitioning and Precision Assignment

Activations are partitioned into three groups ( $\boldsymbol{P}_4$ , $\boldsymbol{P}_6$ , $\boldsymbol{P}_8$ ), quantized to MXFP4, MXFP6, and MXFP8, respectively. The assignment is determined by sorting channels according to their absolute mean values, computed from a calibration dataset. The quantization error for each channel is constrained by:

$E({X}_{j})_{\{MXFP4,MXFP6\}} \leq \overline{E}({X}_{j})_{INT8}$

where $\overline{E}({X}_{j})_{INT8}$ is the INT8 quantization error upper bound. The quantization threshold $T(n)$ for $n$ -bit formats is derived to ensure this constraint.

Offline Calibration and Online Reordering

Channel indices for each precision group are determined offline using calibration data, exploiting the observed stability of channel statistics across datasets and batch sizes. During inference, activations are reordered online to match the precomputed indices, while weights are reordered and quantized offline.

Fused Kernel Implementation

MicroMix integrates a fused reorder-and-quantize operation within the GEMM kernel, minimizing memory access irregularities and overhead. The kernel suite, built on CUTLASS, supports arbitrary mixing ratios of MXFP4, MXFP6, and MXFP8, with dequantization deeply fused into MMA instructions for maximal throughput.

Experimental Results

Accuracy and Robustness

MicroMix retains over 95% of FP16 accuracy on zero-shot tasks and over 90% on five-shot MMLU, with perplexity degradation within 20%. On code generation (Human-Eval) and mathematical reasoning (GSM8K, MMLU-STEM, CMATH), MicroMix incurs less than 3% accuracy drop compared to FP16, and in some cases surpasses FP16 and FP8 baselines.

Efficiency and Scalability

MicroMix achieves 8–46% kernel-level acceleration over TensorRT-FP8 on RTX 5070Ti and 16–46% on RTX 5090. Transformer block execution speed increases by 6–29%, and end-to-end throughput improves by 3.5–9.7%, with peak memory usage reduced by ~20%. The fused reorder-and-quantize operation accounts for 7–17% of runtime, with FP4 GEMM dominating the compute profile.

Ablation and Format Variants

Variants of MXFP6 and MXFP8 (E3M2, E4M3, E5M2) were evaluated, with E3M2 (MXFP6) yielding up to 5% higher accuracy than E2M3. Calibration dataset choice (WikiText2, Pile, Human-Eval) had negligible impact on performance, with fluctuations within 1%.

Implementation Considerations

Hardware Requirements: MicroMix is optimized for NVIDIA Blackwell architecture, requiring FP4 Tensor Core support. The kernel suite is built on CUTLASS and TensorRT-LLM.
Memory Efficiency: Average bit-width per element is 5.2–5.5 across tested models, with significant memory savings over FP16.
Deployment: Offline calibration and weight quantization are one-time costs; online activation reordering is efficiently fused with quantization.
Scalability: The method generalizes across Llama and Qwen model families, batch sizes, and sequence lengths.

Implications and Future Directions

MicroMix demonstrates that hardware-algorithm co-design, leveraging MX data formats and adaptive precision allocation, can substantially improve both accuracy and efficiency of LLM inference. The explicit management of quantization error via thresholding and channel-wise adaptivity sets a new standard for mixed-precision quantization. Future work may explore dynamic online calibration, integration with training pipelines, and extension to other hardware platforms supporting flexible precision formats. The approach is also amenable to further granularity in precision assignment and could be combined with pruning or low-rank compression for additional gains.

Conclusion

MicroMix provides an efficient, adaptive mixed-precision quantization framework for LLMs, co-designed with hardware capabilities of modern GPUs. By integrating quantization error management, offline calibration, and fused kernel operations, MicroMix achieves state-of-the-art accuracy and significant speedups over existing baselines. The method is robust across tasks, models, and datasets, and sets a foundation for future research in hardware-aware quantization and efficient LLM deployment.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)