Papers
Topics
Authors
Recent
2000 character limit reached

Mixed-Precision Training & Compilation

Updated 6 February 2026
  • Mixed-precision training and compilation frameworks are methods that leverage heterogeneous data precisions to balance computational efficiency and predictive accuracy.
  • They combine quantization-aware training with adaptive precision assignment via techniques like reinforcement learning, genetic sampling, and ILP optimization.
  • Real-world deployments on RRAM accelerators, edge AI, and GPU-based LLM training achieve significant speedups with minimal accuracy degradation.

Mixed-precision training and compilation frameworks enable neural network models to exploit heterogeneous data precisions (e.g., FP8, FP4, 8b, sub-8b integer/floating point) at a fine granularity to balance computational efficiency and predictive accuracy. These frameworks combine quantization-aware training, adaptive precision assignment, hardware-constrained optimization, and compilation pipelines targeting both custom accelerators and general-purpose compute cores. State-of-the-art systems such as "Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators" (Pelke et al., 29 Jan 2026), MiCo (Jiang et al., 13 Aug 2025), and SNIP (Pan et al., 1 Feb 2026) demonstrate the application of mixed-precision techniques from inference on analog RRAM crossbars to full training of LLMs on modern GPUs, achieving substantial runtime reductions with minimal accuracy degradation.

1. Principles of Mixed-Precision Quantization

Mixed-precision quantization (MPQ) leverages the observation that different neural network layers and tensors can tolerate varying degrees of precision loss without equivalently affecting output quality. Instead of enforcing a uniform bit-width for all weights and activations, MPQ allows the per-layer selection of precision [(bw1,ba1),,(bwL,baL)]\left[(b_{w_1},b_{a_1}),\ldots,(b_{w_L},b_{a_L})\right], where bwlb_{w_l} and balb_{a_l} are the bit-widths for weights and activations in layer ll (Jiang et al., 13 Aug 2025).

Key methodologies include:

  • Symmetric (linear) quantization: Conversion of real-valued tensors xfx_f to quantized integers xqx_q via xq=clip ⁣(xfs,qmax,qmax),s=2B11maxxfx_q = \mathrm{clip}\!\left(\left\lfloor x_f \cdot s \right\rceil, -q_{\max}, q_{\max}\right),\qquad s = \frac{2^{B-1}-1}{\max|x_f|} for BB-bit quantization (Pelke et al., 29 Jan 2026).
  • Quantization-aware training (QAT): Fake-quantization nodes simulate quantization during forward passes. The straight-through estimator allows gradients to propagate unaffected, training the network to recover robustness to quantization artifacts (Pelke et al., 29 Jan 2026, Jiang et al., 13 Aug 2025).
  • Per-layer bit-width allocation: Algorithms seek optimal assignments of {bwl,bal}\{b_{w_l},b_{a_l}\} per layer, guided either by explicit constraint optimization or learned policies. Constraints can include IO, minimum quality, and hardware-aligned granularity (e.g., weight bitwidths as multiples of RRAM crossbar cell resolution) (Pelke et al., 29 Jan 2026, Jiang et al., 13 Aug 2025).

2. Optimization Strategies for Precision Assignment

The search for optimal mixed-precision assignments addresses a high-dimensional discrete optimization under complex accuracy–latency tradeoffs:

  • Reinforcement learning (RL): In RRAM CIM targets, a DDPG RL agent observes layer metadata and resource usage to select per-layer bit-width pairs. The reward is based on model accuracy and normalized latency, using a piecewise-defined function that penalizes accuracy loss below a threshold and rewards latency improvement otherwise (Pelke et al., 29 Jan 2026).
  • Ensemble-based predictors and genetic sampling: MiCo models Acc(MQ)\mathrm{Acc}(M_Q) using a Random Forest trained on sampled MPQ evaluations and initial orthogonal exploration, followed by population-based search near the latency/CBOP constraint boundary that avoids the combinatorial explosion of exhaustive search (Jiang et al., 13 Aug 2025).
  • Integer Linear Programming (ILP): SNIP formalizes mixed-precision assignment as a layerwise ILP, optimizing for minimal estimated loss/weight divergence subject to per-layer options and a global or pipeline stage–balanced efficiency constraint (fraction of FLOPs in lowest precision) (Pan et al., 1 Feb 2026).

3. Mixed-Precision Training for Hardware Specialization

Frameworks deliver transformation pipelines tuned to hardware-specific constraints and execution models:

  • RRAM-based CIM accelerators: MPQ is crucial for high-efficiency in-memory matrix-vector multiplications, but device physics restrict cell and DAC resolutions. Layer weights/activations with higher bit-widths are mapped via bit-slicing across multiple crossbar cells and time-multiplexed input cycles. The compilation algorithm calculates crossbar write and MVM counts per layer and schedules loops for weight-stationary dataflow (Pelke et al., 29 Jan 2026).
  • Edge AI accelerators: MiCo calibrates its cost model with composite BOPs (CBOP) fitted to target hardware (e.g., BitFusion, SIMD-RISC-V). Kernel libraries cover the full Cartesian product of {8,4,2,1}\{8,4,2,1\} weight and activation bit-widths, allowing C-code generation from PyTorch graphs annotated with bitwidth metadata (Jiang et al., 13 Aug 2025).
  • LLM training on GPUs: SNIP exploits subbyte (FP4, FP8) support, with precision switching mediated solely by insertions of quantize/dequantize around GEMM ops. Adaptive assignment leverages periodic statistics collection during training to optimize per-layer scheduling (Pan et al., 1 Feb 2026).

4. Precision-Loss Metrics and Quality Guarantees

To quantify and control the impact of low-precision quantization during training, formal error metrics are employed:

  • Loss divergence (Δloss\Delta_{\mathrm{loss}}): The normalized difference in training loss between full- and low-precision execution, derived via Taylor expansion and accounting for both activation and weight quantization errors (Pan et al., 1 Feb 2026).
  • Weight divergence (Δweight\Delta_{\mathrm{weight}}): Aggregated relative parameter perturbation after a quantized SGD/AdamW step, capturing error propagation mechanisms in the backward pass (Pan et al., 1 Feb 2026).
  • Soft-constrained optimization: Some frameworks allow soft penalties for budget violations, offering tunable tradeoffs between performance and model quality (Jiang et al., 13 Aug 2025).

5. Compilation and Deployment Pipelines

Frameworks integrate frontends, optimization, and deployment layers:

  • Model import and annotation: Quantized models can be exported (e.g., ONNX from Brevitas QAT) and enriched with QuantizeLinear/DeQuantizeLinear metadata indicating layer precisions (Pelke et al., 29 Jan 2026). PyTorch tracing/transformation (FX graph) is used for operator-level MPQ labeling (Jiang et al., 13 Aug 2025).
  • Graph-level transformations: Removal and fusion of quantization operators, conversion to integer ops, and scheduling with explicit bit-slicing and loop tagging for accelerator mapping are applied (Pelke et al., 29 Jan 2026).
  • Lowering and kernel selection: Schedulers insert hardware-aligned buffer allocations and replace core computational loops with architecture-specific calls (e.g., runMVM for CIM API, DotP.4x2 for SIMD-RISC-V cores) (Pelke et al., 29 Jan 2026, Jiang et al., 13 Aug 2025).
  • Bare-metal code generation: For edge AI, Python orchestration scripts can walk annotated graphs and emit C code with appropriate kernel library calls, requiring only native toolchain compilation (Jiang et al., 13 Aug 2025).

6. Experimental Results and Hardware Efficiency

Key empirical results demonstrate the impact of MPQ frameworks:

Framework Model Speedup / Efficiency Accuracy Loss Target Hardware
(Pelke et al., 29 Jan 2026) VGG-16 2.48× (vs. 8b) 0.086% RRAM CIM (256×256)
(Jiang et al., 13 Aug 2025) VGG7 (BitFusion) 0.77× FP32 latency 0.01% BitFusion Systolic Array
(Jiang et al., 13 Aug 2025) LeNet5 (SIMD-RISC) 0.82× FP8 latency ~2.8% VexiiMiCo SIMD-RISC-V
(Pan et al., 1 Feb 2026) TinyLlama 1B-70B up to 80% FLOPs in FP4 ≤0.5% degradation Hopper/Blackwell GPUs

The best-case speedups are achieved with negligible accuracy degradation, for example, VGG-16 achieves 2.48× speedup over 8-bit RRAM CIM baselines with only 0.086% accuracy loss (Pelke et al., 29 Jan 2026). On LLMs, SNIP delivers up to 80% reductions in FP8/FP16 FLOPs with downstream accuracy within 0.2–0.5 points of full precision (Pan et al., 1 Feb 2026).

7. Limitations and Prospects

Current mixed-precision frameworks are limited by:

  • Discrete search space size: Combinatorial explosion in per-layer bitwidth selection motivates surrogate modeling and constraint-aligned sampling; RL, ILP, and regression-based predictors partially address this.
  • Scheduler/solver overhead: Periodic ILP solve time (≤30 s) is amortized across large training windows in SNIP, but more frequent updates would be bottlenecked (Pan et al., 1 Feb 2026).
  • Hardware/Compiler integration: True subbyte FP4/FP8 support in mainstream toolchains (e.g., TVM, XLA) is nascent, and custom kernel/primitive support is often necessary (Pan et al., 1 Feb 2026, Pelke et al., 29 Jan 2026).
  • Static vs. dynamic scheduling: Most frameworks update precision schedules infrequently; per-batch, fully dynamic switching would require deeper integration with runtime and hardware APIs (Pan et al., 1 Feb 2026).

A plausible implication is that as subbyte compute and more heterogeneous compute substrates emerge, frameworks will further integrate hardware-feedback into both precision selection and operator mapping, blurring boundaries between training, compilation, and runtime scheduling.


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixed-Precision Training and Compilation Framework.