Fine-Grained Mixed Precision Assignment

Updated 20 October 2025

Fine-Grained Mixed Precision Assignment is a strategy that allocates varied numerical precisions at granular levels to balance accuracy, memory usage, and hardware efficiency.
It employs profiling, sensitivity metrics, and automated optimization methods such as NAS and bi-level optimization to guide per-layer, per-channel, or per-operation precision choices.
This approach underpins state-of-the-art solutions in deep learning quantization, HPC simulations, and accelerator co-design, achieving significant energy and performance gains.

Fine-Grained Mixed Precision (FGMP) Assignment refers to the strategy of assigning different numerical precisions to different parts of a computational workload—ranging from neural network weights or activations, to floating-point operations in scientific code, to individual tiles in matrix multiply—so as to balance trade-offs between accuracy, memory usage, performance, and hardware efficiency. Rather than applying a uniform precision globally, FGMP assigns precision levels at the level of layers, channels, blocks, clusters, or even individual operations, often guided by sensitivity metrics, profiling, or automated optimization frameworks.

1. Definitions, Motivations, and Core Principles

FGMP aims to exploit the non-uniform sensitivity of computations to numerical precision by adapting bit-width or number format at a granular level. The rationale is that not all data or computations within a model or algorithm are equally critical to the final outcome, so less sensitive elements can be reduced to lower precision—saving memory, power, and computation—while more sensitive elements retain higher precision to preserve overall accuracy and stability. Forms of granularity in FGMP include per-layer, per-channel, per-block (e.g., small tensor tiles), per-cluster (e.g., 3 weights per group), or per-operation precision choices (Risso et al., 2022, Xie et al., 28 Apr 2025, Shen et al., 26 Jul 2025, Zhang et al., 20 Aug 2025).

Historically, mixed-precision was first applied in high-performance scientific computing to avoid excessive roundoff error, but with the rise of deep learning and custom hardware, FGMP is now pervasive in neural network quantization, inference acceleration on FPGAs and RISC-V CPUs, distributed and federated learning, and HPC matrix kernels.

2. Profiling-Driven and Automated Precision Assignment

Early methodologies for FGMP, such as Profile-Driven Automated Mixed Precision (AMP), instrument the program (at the intermediate representation (IR) level) to profile every floating point operation's numerical characteristics, including dynamic round-off error, cancellation, and overflow/underflow events (Nathan et al., 2016). Each operation is monitored and assigned an "error profile", which then informs if it should be "promoted" to higher precision or left at the original precision. Instructions are partitioned into cancellation, promotion, benign, and other bins, with IR rewrites cascading upgrades in the event of cancellation faults.

For neural networks, differentiable optimization-driven methods replace discrete bit-widths with soft assignments over candidate formats, enabling bi-level optimization (over network weights and precision parameters) subject to accuracy constraints (Cheng et al., 2018). Neural Architecture Search (NAS) can extend this idea, with gradients controlling per-channel or per-block precision (Risso et al., 2022, Luo et al., 2023). Sensitivity or loss impact (e.g., Fisher Information weighted error) further guides assignment granularity, especially in recent LLM quantization (Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025).

Assignment Scheme	Granularity	Decision Mechanism
Profile-driven (AMP)	Per-operation	Dynamic error profiling
Layer/channel NAS	Per-layer/channel	Gradient-based NAS w/ Softmax
Block/cluster sensitivity	Per-block/cluster	Fisher-weighted impact score
Tile-centric HPC	Per-tile	User-specified or runtime

3. Mathematical Formulation and Optimization Techniques

FGMP assignment strategies are mathematically formalized using error models, loss-aware optimization, or hardware-aware constraints:

Error Quantification: Rounding error is measured relative to ULP or reference solution; sensitivity metrics such as $\mathbb{E}[g^2 (\Delta v)^2]$ (where $g$ is the gradient or Fisher Information for a parameter $v$ ) inform the precision assignment (Hooper et al., 19 Apr 2025).
Bi-level Optimization: Continuous relaxations allow assignment to be optimized jointly with weights. For layer-wise quantization, softmax mixing $\displaystyle q^{(i)} = \frac{\sum_j e^{\alpha_{ij}}\mathcal{B}(q_{ij})}{\sum_j e^{\alpha_{ij}}}$ renders choice differentiable, enabling backpropagation (Cheng et al., 2018).
Regularization and Constraints: Hybrid loss functions combining accuracy, compression, energy or resource usage are used. Complexity regularizers like $L_R$ penalize total bit usage or expected energy (Risso et al., 2022, Luo et al., 2023).
Dynamic and Adaptive Assignment: FineQ’s per-cluster scheme (Xie et al., 28 Apr 2025) and FAST’s adaptive BFP mantissa width assignment (Zhang et al., 2021) rely on local statistics and thresholds decaying over time or iterations, allowing mixed precisions to track convergence or outlier behavior dynamically.

4. Hardware-Aware and Co-Designed Implementations

FGMP is tightly coupled with hardware capabilities:

FPGA/ASIC Accelerator Co-Design: Flexible datapaths, pipeline stages, or DSP-packing enable the accelerator to efficiently support irregular precision at fine granularity. Algorithms such as DeepBurning-MixQ’s DSP-packing (Luo et al., 2023) or M4BRAM’s compute-in-BRAM (Chen et al., 2023) dynamically “pack” low-precision computations to maximize resource utilization and throughput. Kratos parameterizes MAC units for specified precisions and sparsity, hardwiring weights to remove zero/low-precision operators (Dai et al., 8 Jul 2024).
Custom ISA Extensions: For RISC-V CPUs, mode-specific MAC instructions (e.g., nn_mac_8b, nn_mac_4b, nn_mac_2b) boost bit-serial and SIMD parallelism for low-bit-width operations, requiring only minor area/power overhead and substantial energy savings (Armeniakos et al., 19 Jul 2024).
Tile-Centric GEMM and Runtime Scheduling: In HPC, adaptive frameworks (e.g., with PaRSEC) decompose GEMM into tile tasks, each with a precision label, enabling efficient scheduling and resource allocation in heterogeneous systems (Zhang et al., 20 Aug 2025).
In-Memory and Heterogeneous Accelerators: Emerging concepts such as phase change memory (PCM)-based in-memory compute deploy mixed or low precision for coarse computation and digital post-processing for refinement (Gallouédec, 2021).

Hardware Platform	FGMP Technique	Efficiency Benefits
FPGA (BRAM, DSP)	Packing, block/cluster assignment	Higher MAC throughput, area/power reduction
RISC-V core	Custom MAC ISA, multi-pumping	10–15× energy, ~1% accuracy loss
GPU/AI Accelerators	Tile-centric SIMD, TF32, FP32	2–370× speedup, negligible acc. loss
In-memory/PCM	Analog partial compute + digital refine	Energy reduction, scalable precision

5. Error Analysis, Robustness, and Accuracy Considerations

Quantitative error analysis underpins FGMP assignment. For floating-point operations, deterministic backward error analysis yields conservative error bounds scaling with the number of operations and unit roundoff (e.g., $|\theta_n| \leq (nu)/(1-nu)$ ), while probabilistic (variance-informed) analysis gives tighter $\tilde{\gamma}_n(\lambda)$ bounds leveraging random error cancellation (Bhola et al., 27 Nov 2024). In finite element assembly, the error due to mixed precision can be shown to depend only on the precision of storage when heavier computations (e.g., matrix-matrix products) are performed at higher precision, resulting in robustness to polynomial degree, mesh quality, or node count (Croci et al., 16 Oct 2024).

In DNNs and LLMs, block/cluster-wise sensitivity metrics precisely identify the minimal set of high-precision assignments needed for <1% degradation in perplexity or classification accuracy—even with 2–4 bit quantization for the majority of weights and activations (Hooper et al., 19 Apr 2025, Xie et al., 28 Apr 2025, Jang et al., 2 Jan 2025). Empirical results consistently indicate that naive, uniform reduction to low precision (e.g., all FP16/INT2) causes unacceptable errors, whereas FGMP guided by profiling or sensitivity achieves near-baseline accuracy.

6. Real-World Applications and Contemporary Impact

FGMP assignment is integral to resource-constrained inference (edge DNNs, LLM on-device), high-throughput scientific computing, distributed/federated learning, and accelerator-rich systems:

Neural Networks and LLMs: FGMP with hardware-aware policies (e.g., sensitivity-weighted block assignment, block-dialect quantization) achieves 10–30% memory and energy savings at minimal accuracy loss (Hooper et al., 19 Apr 2025, Jang et al., 2 Jan 2025). FineQ’s cluster-level outlier protection at 2.33 average bits per weight outperforms SOTA schemes (Xie et al., 28 Apr 2025).
Edge and IoT Devices: TinyML deployments benefit from channel-wise precision assignments, providing 63% model size and 27% energy reduction compared to layer-wise (Risso et al., 2022). Co-designed RISC-V cores and mixed-precision datapaths yield 10–15× energy improvement (Armeniakos et al., 19 Jul 2024).
Federated Learning and Heterogeneous Systems: FGMP enables distributed training with mixed client precisions in over-the-air aggregation, accelerating convergence and yielding up to 65% energy savings in ultra-low precision clients (Yuan et al., 4 Jun 2024).
HPC and Scientific Simulation: Turbulent flow solvers using FGMP improve computational efficiency on multi-CPU/multi-GPU systems, with mixed precision reducing memory and communication requirements significantly while ensuring accuracy via careful assignment (Siklósi et al., 27 May 2025, Shen et al., 26 Jul 2025).

7. Limitations and Future Directions

Current challenges for FGMP assignment include scaling the search/assignment process to even finer granularity (e.g., per-weight or per-output channel), extending hardware support for irregular formats, real-time dynamic reassignment, and integration with techniques such as pruning or in-memory computing (Risso et al., 2022, Jang et al., 2 Jan 2025, Zhang et al., 20 Aug 2025). Gradient-based methods rely on suitable relaxation/temperature schedules and may be sensitive to local minima in large search spaces. Application-specific error analysis—especially leveraging probabilistic bounds—remains important to ensure that aggressive quantization does not compromise model or simulation reliability (Bhola et al., 27 Nov 2024). Integration with future hardware-enabled formatbooks, new memory hierarchies, and real-time adaptive strategies are expected to drive further progress in FGMP methodologies.

In summary, Fine-Grained Mixed Precision Assignment leverages detailed profiling, error sensitivity, and hardware co-design to deliver high performance, memory efficiency, and energy savings while rigorously controlling accuracy loss across a wide spectrum of computational domains. The state-of-the-art encompasses automated profiling-driven IR rewriting, block- and cluster-wise quantization guided by sensitivity metrics, hardware-adaptive mapping for accelerators, and robust error analysis frameworks—all of which support the ongoing expansion of FGMP assignment in both AI and computational science.