MXINT8: Block-Based Integer Format
- MXINT8 is a fine-grained, block-based integer format that employs per-block scaling with a shared exponent and 8-bit per-element integers to optimize memory and computation.
- The method achieves near-lossless inference and training accuracy with less than 0.3% deviation from FP32 while reducing memory usage by approximately 3.9×.
- MXINT8 integrates seamlessly into integer dataflow pipelines, enabling efficient hardware implementations with up to 3.7× speedup and significant energy savings on edge devices.
MXINT8 is a fine-grained, block-based integer data format designed for high-efficiency deep learning inference and training. It belongs to the Microscaling (MX) family, characterized by per-block scaling, integer mantissas, and exponent sharing, providing a superior trade-off between algorithmic accuracy, hardware simplicity, dynamic range, and memory efficiency compared to conventional per-tensor quantization and narrow floating-point (FP) alternatives. MXINT8 underpins both state-of-the-art hardware implementation and algorithmic methods for low-bitwidth neural network representation on resource-constrained and accelerator platforms.
1. Definition and Data Representation
MXINT8 encodes blocks (typically 32 elements) of real values as pairs :
- : a shared block scale, typically stored as an 8-bit exponent in E8M0 format (i.e., exact power-of-two, zero mantissa).
- : a signed 8-bit integer per element, (using symmetric range for training).
- Reconstruction: .
Formally: with , chosen so that fits in over the block (Rouhani et al., 2023, Chen et al., 29 Oct 2025).
Each block thus requires bits for 32 values, an average of bits/value, achieving a reduction in memory compared to FP32. The scaling enables the full use of the INT8 dynamic range for every block, minimizing representational error for high-variance data.
2. Quantization, Dequantization, and Conversion Pipelines
Analytic Conversion (Direct-Cast)
The canonical quantization pipeline for float-to-MXINT8 conversion is as follows (Rouhani et al., 2023, Gorodecky et al., 2024, Chen et al., 29 Oct 2025):
- Shared Scale Selection: Compute over each block of 32, round up to nearest power-of-two , and use (E8M0 encoding).
- Quantization: .
- Block Packing: Store , then each .
Dequantization is simply: .
This approach requires no calibration or quantization-aware retraining and enables near-lossless inference for standard tasks (Rouhani et al., 2023, Chen et al., 29 Oct 2025).
Hardware Pipelines
Efficient hardware realizations use combinational datapaths:
- Max-exponent finder: A comparator tree computes .
- Scale generation: Computes based on and handles NaN/Inf.
- Per-lane quantization: Each lane forms from sign, local exponent, and mantissa bits with round-to-nearest-even (Gorodecky et al., 2024).
FPGAs implement this flow at 19.8M vectors/sec with pure combinational logic (no BRAM/DSP), requiring 1,614 LUTs for 32-lane conversion (Gorodecky et al., 2024).
3. Training and Symmetric Clipping
MXINT8 supports both static and quantization-aware training (QAT), provided symmetric quantization is enforced (Chen et al., 29 Oct 2025):
- Two's-complement asymmetry (INT8's ) introduces gradient bias; symmetric clamp to eliminates this and ensures unbiased updates.
- Six per-layer quantizations are typical in GEMM: weights, activations, backward-activations, weights, backward-weights, activations.
- The straight-through estimator is used for gradients; accumulations remain in FP32 during training.
Lossless accuracy (within ) is achievable for both inference and training across a range of model scales and tasks; e.g., LLMs, vision models (Chen et al., 29 Oct 2025, Wu, 2020).
4. Algorithmic Accuracy, Task Performance, and Comparisons
Empirical studies on >20 benchmarks (ImageNet, LLaMA, GPT-3, transformer tasks) show:
- MXINT8 matches FP32 in direct-cast inference within $0.1$– accuracy margin (Rouhani et al., 2023, Chen et al., 29 Oct 2025).
- Wins over blockwise FP8 (E4M3, E5M2) for 8-bit block-32 configurations; at 4-bit, FP4 can be more robust without Hadamard rotation (Chen et al., 29 Oct 2025).
- Training over Llama-style models: MXINT8, BF16, and MXFP8 track closely in loss and accuracy, with MXINT8 slightly outperforming on most tasks (Chen et al., 29 Oct 2025).
- On edge hardware, INT8 pipelines (e.g., IntAttention) achieve up to speedup and energy savings with negligible accuracy drop, due to complete avoidance of dequantize–requantize overheads (Zhong et al., 26 Nov 2025).
Key comparative points:
| Format | Block Size | Accuracy (Δ FP32) | Area/Energy Rel. FP8 | Best Use Case |
|---|---|---|---|---|
| MXINT8 | 32 | <0.1–0.3% | 0.79×/0.63× (Chen et al., 29 Oct 2025) | General, LLMs |
| MXFP8 | 32 | <0.3% | 1.0×/1.0× | FP-dominated |
| NVINT4 | 16 | Needs rotation | <0.4% | Extreme low-bit |
MXINT8 has a fundamental advantage for moderate crest factor distributions (), typical of deep learning layers with block size 32, and suffers less from outlier-induced overflow than per-tensor INT8 (Chen et al., 29 Oct 2025).
5. Hardware Implementation and Architectural Integration
MXINT8's structure is well-suited for accelerator integration:
- Shared exponent per block allows integer block-wise MAC without frequent normalization or shift-align logic (Cheng et al., 2023, Cuyckens et al., 9 Nov 2025).
- 8×8 hybrid MAC arrays exploit integer multiplication and accumulation for each block, with post-accumulation scaling.
- Integer accumulation eliminates FP alignment, cuts area and dynamic power—energy efficiency for MXINT8 measured at 657 GOPS/W at 64 GOPS throughput in SNAX NPU (22FDX, 500 MHz) (Cuyckens et al., 9 Nov 2025).
- Conversion on FPGAs and NPUs can be pipelined, operating at vector-level throughput with minimal control overhead (Gorodecky et al., 2024, Cuyckens et al., 9 Nov 2025).
Compared to fixed-point and per-tensor INT8, MXINT8 avoids precision loss under high local dynamic range, with only 1.2 area overhead vs. INT8 and negligible energy increase (Cheng et al., 2023).
6. Integration with Integer Dataflow and Transformations
MXINT8 enables end-to-end integer execution. Recent developments extend integer dataflow throughout the transformer block:
- Attention pipelines (such as IntAttention) keep all major matrix-multiplies, softmax surrogates (IndexSoftmax), and normalization in integer, with integer LUTs and normalization, supporting full plug-and-play deployment without retraining (Zhong et al., 26 Nov 2025).
- Integer transformers employ integer-friendly nonlinearities (e.g., polynomial attention, L1 norm layernorm) and propagate scales directly, entirely within INT8/INT32 domains except on rare overflow (Lin et al., 2020).
This enables deployment of large models on edge devices, yielding 3–4 speedups, 4 model compression, and retaining essentially all baseline accuracy.
7. Practical Programming, Inference Engines, and Compiler Orchestration
- Programming mixed-precision MXINT8 inference on RISC-V or ARM CPUs leverages status-based SIMD instructions, allowing per-layer bitwidth selection from a status register, supporting run-time reconfigurability without ISA expansion (Ottavi et al., 2020).
- Dataflow compilers (e.g., MASE) optimize per-tensor mantissa widths and MXINT8 block shapes to maximize accuracy and minimize area/throughput at the compiler IR level, with automated hardware RTL emission for MXINT8 operations (Cheng et al., 2023).
- For inference libraries (e.g., CUDA MX library), quantization, block-wise dot-product, and dequantization are handled in optimized kernels, easily integrated as direct drop-ins to standard inference stacks (Rouhani et al., 2023).
References
- (Rouhani et al., 2023), "Microscaling Data Formats for Deep Learning"
- (Gorodecky et al., 2024), "Hardware for converting floating-point to the microscaling (MX) format"
- (Chen et al., 29 Oct 2025), "INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats"
- (Zhong et al., 26 Nov 2025), "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference"
- (Cuyckens et al., 9 Nov 2025), "Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration"
- (Cheng et al., 2023), "A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats"
- (Bruschi et al., 2020), "Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices"
- (Lin et al., 2020), "Towards Fully 8-bit Integer Inference for the Transformer Model"
- (Wu, 2020), "Learning Accurate Integer Transformer Machine-Translation Models"
- (Ottavi et al., 2020), "A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference"
- (Zhu et al., 2019), "Towards Unified INT8 Training for Convolutional Neural Network"