Papers
Topics
Authors
Recent
2000 character limit reached

Low-Precision Arithmetic

Updated 12 February 2026
  • Low-precision arithmetic is a numerical method using restricted bit-width (≤16 bits) formats to trade off some accuracy for efficiency.
  • It encompasses diverse representations such as fixed-point, reduced floating-point, block floating-point, posits, and logarithmic number systems to balance performance and error.
  • Innovative strategies like stochastic rounding, blockwise accumulation, and hardware optimizations enable robust implementations in machine learning, signal processing, and scientific computing.

Low-precision arithmetic refers to numerical computations executed using formats with restricted bit widths for number representation. Unlike the classical IEEE single (32-bit) or double (64-bit) floating-point, low-precision formats deploy ≤16 bits per element—commonly 8-, 10-, 12-, or 16-bit fixed-point, floating-point, block floating-point, posits, or other custom formats. The primary motivations are to lower memory and energy costs, accelerate processing, and increase resource utilization, especially in applications such as machine learning, signal processing, scientific computing, and embedded systems, where tolerable loss in numerical fidelity enables substantial computational savings.

1. Number Representations and Rounding Modes

Low-precision arithmetic encompasses a spectrum of number formats:

  • Fixed-point (FxP): A value is encoded as a signed or unsigned integer (ww bits), with nn bits for the fractional part, following the Qm.nQ_{m.n} notation. Dynamic range is determined by mm (integer bits); quantization step is q=2nq=2^{-n}; absolute error is at most q/2q/2 for round-to-nearest (Sentieys et al., 2022).
  • Floating-point (FlP): Reduced-precision versions of IEEE 754 use fewer exponent and mantissa bits (e.g., bfloat16: 8 exponent, 7 mantissa), limiting precision and range. Error is bounded by ϵrel2(M+1)|\epsilon_\text{rel}| \leq 2^{-(M+1)} for MM-bit mantissa (Sentieys et al., 2022, Zhang et al., 2019).
  • Block Floating-point (BFP): Tensors are partitioned so blocks share an exponent; each element carries its own mantissa, yielding a trade-off between local dynamic range and hardware simplicity (Zhang et al., 2019).
  • Posit: Combines sign, regime (variable run-length coding), exponent (parameter eses), and fraction fields; dynamic allocation of regime and exponent enables tapered precision and dramatically extended dynamic range at low bitwidth (nn). For example, posit16 provides up to 12 bits of precision and a range of 2562^{56} vs. 2152^{15} for IEEE binary16 (Mallasén et al., 30 Jan 2025).
  • Logarithmic Number System (LNS): Represents data as quantized log-magnitudes and sign; multiplication becomes addition, and addition is replaced by a log-domain function, which is approximated for hardware efficiency in low-precision settings (Hamad et al., 20 Oct 2025).

Rounding:

  • Deterministic modes: round-to-nearest, round-to-zero, round-away-from-zero, and nearest-even.
  • Stochastic rounding (SR): For a real value xx between floating-point numbers x1<x<x2x_1 < x < x_2, SR randomly chooses x1x_1 or x2x_2 with probabilities proportional to distance, yielding unbiased expectation and improved preservation of small-magnitude signals in ML optimization and dynamical simulations (Zhang et al., 2019, Paxton et al., 2021).

2. Hardware, Emulation Frameworks, and Implementation Strategies

Low-precision arithmetic is realized both on specialized hardware and in software emulators for development and analysis:

  • QPyTorch simulates arbitrary fixed-point, floating-point, and block floating-point arithmetic with configurable quantization and rounding via Python APIs. Employs a "two-kernel" strategy: full-precision PyTorch op, then a custom CUDA quantization kernel, enabling simulation of large models with low computational overhead (Zhang et al., 2019).
  • PHEE (wearables): Hardware realization of posit arithmetic via RISC-V coprocessor ("Coprosit"). PRAU supports various posit bitwidths, achieving up to 38% area and 54% power reduction compared to 32-bit FPUs (Mallasén et al., 30 Jan 2025).
  • Bitslice vector types: Bitslice libraries implement arbitrary-precision, custom floating-point arithmetic as bitplane-transposed arrays, vectorizing operations across all lanes in a word using bitwise logic, yielding substantial speedup and bandwidth savings for FP8/FP10/FP16 (Xu et al., 2016).
  • FPGA optimizations: Low-precision floating-point and INT packing maximize DSP utilization, e.g., four 8-bit LPFP multiplies per DSP vs. two for 8-bit INT, and "overpacking" (with partial overlap) for further gains with controlled error (Wu et al., 2020, Sommer et al., 2022).

3. Performance, Accuracy, and Error Analysis

Reducing precision offers pronounced performance and memory advantages but requires careful error analysis:

  • Speed and energy: On hardware with low-precision accelerators (e.g., FPGAs, NVIDIA tensor cores), speedups of 2×2\times8×8\times (vs. FP32/FP64) are typical for matrix operations and ML inference/training (Abdelfattah et al., 2020, Wu et al., 2020). In vision CNNs, 8-bit floating-point LPFP formats yield $3$–11×11\times higher throughput per DSP and up to 64×64\times over CPUs, with negligible (<0.5%<0.5\%) accuracy loss (Wu et al., 2020).
  • Accuracy and application tolerance: Average relative errors increase with lower bitwidth; e.g., FP8 (\sim5.4% mean, 25% max), FP16 (\sim0.1% mean, 0.5% max) in bitslice arithmetic (Xu et al., 2016). In biomedical pipelines, posit16 matches float32 accuracy, whereas float16 is insufficient (Mallasén et al., 30 Jan 2025).
  • Probabilistic error modeling: Probabilistic analysis (VIBEA) provides tighter bounds than deterministic worst-case error, particularly for large nn (block-wise sums), scaling as O(nu)O(\sqrt{n} u) rather than O(nu)O(nu) (Bhola et al., 2024, Dahlqvist et al., 2019). In low-precision, error distributions may become sharply peaked; careful quantization-aware calibration, context scaling, and blockwise accumulation mitigate overflow and bias (Ortiz et al., 2018, Chen et al., 2022).

4. Applications: Machine Learning, Signal Processing, Scientific Computing

  • Neural network training: 12–14-bit floating/fixed-point with stochastic rounding can nearly match 32-bit baseline accuracy in CNNs; context scaling and power-of-two quantization schemes further reduce hardware cost (Ortiz et al., 2018). Low-precision batch-normalized activations (2–8 bits) in deep networks enable up to 12×12\times memory reduction, with <1.5<1.5% accuracy degradation (Graham, 2017).
  • Gaussian processes: Accurate GP training on 10610^610710^7 points achieved by fp16 MVMs with fp32 accumulations, preconditioning, and orthogonalization; pure fp16 fails without such measures (Maddox et al., 2022).
  • Signal processing and fast convolution: Symbolic Fourier Convolution (SFC) reveals addition-only transforms and correction terms, yielding 3.7×3.7\times reduction in multiplies for 3×33 \times 3 convolutions at sub-percent accuracy loss under int4 quantization, outperforming Winograd in both numerical conditioning and hardware efficiency (He et al., 2024).
  • Scientific computing and climate modeling: For linear solvers, mixed-precision (e.g., fp16 factorizations, fp32/fp64 residuals) offers robustness and near-double-precision accuracy for well-conditioned or mildly ill-conditioned problems (κ106\kappa \lesssim 10^6). In dynamical systems and climate codes, single or even half precision suffices for the majority of variables, with stochastic rounding being essential below 12 bits (Paxton et al., 2021, Scott et al., 2024).

5. Methodologies for Robust Low-Precision Computation

Low-precision brings new algorithmic requirements for stability and quality:

  • Iterative refinement and mixed precision: LU/Cholesky factorization in low-precision, with high-precision solution refinement and blockwise accumulation, ensures backward and forward errors commensurate with high-precision solution when κ(A)<1/ulow\kappa(A) < 1/u_\text{low} (Abdelfattah et al., 2020, Scott et al., 2024).
  • Blockwise accumulation: Long sums in low precision are partitioned into small blocks, accumulating within high-precision; this block-based approach bounds error and avoids catastrophic overflow/underflow (Bhola et al., 2024, Chen et al., 2022).
  • Resilience enhancements: For deep learning and inverse problems, stochastic rounding (especially in low bitwidth backpropagation) preserves gradient flow and reduces systematic bias, improving convergence in regimes where deterministic rounding is inadequate (Zhang et al., 2019, Paxton et al., 2021).
  • Hardware-friendly approximations: Piece-wise linear log-domain adders in LNS, with bitwidth-specific optimization via simulated annealing, can deliver sub-percent degradation in ML training performance at 12–14 bits, simultaneously reducing MAC area and energy by >25% and 45% compared to integer or FP MACs (Hamad et al., 20 Oct 2025).

6. Practical Guidelines and Trade-offs

Domain-dependent optimization is crucial:

  • Format selection: For limited dynamic range, fixed-point is preferred, offering more energy-efficient addition and simpler hardware. When moderate dynamic range or relative error tolerance is needed, low-bit floating-point or posit formats are effective—posit in particular for biomedical and ML workloads requiring high dynamic range at minimal bitwidth (Sentieys et al., 2022, Mallasén et al., 30 Jan 2025).
  • Bitwidth tuning: Simulation and word-length optimization should be performed for each application to balance quality metric (CMSE, MSE, classification error) with hardware cost model, leveraging existing flowcharts for selecting integer/exponent and fraction/mantissa bits (Sentieys et al., 2022).
  • FPGA/ASIC packing: Packing multiple low-precision multiplies/adds into one DSP maximizes throughput, with careful engineering to avoid field overlaps; when overlap is tolerated, correct for resulting errors with local post-processing logic (Sommer et al., 2022).
  • Numerical analysis: Always apply blockwise accumulation, probabilistic error bounds, and context scaling, especially in large-scale or ill-conditioned computations. Regularly monitor for overflows, infs/nans, and adverse error propagation, especially with half-precision (Bhola et al., 2024, Chen et al., 2022, Scott et al., 2024).
  • Hybrid/mixed-precision flows: Low-precision should be used wherever data-uncertainty and application metrics allow, but always in conjunction with high-precision accumulation and solution refinement for critical or ill-conditioned kernels (Abdelfattah et al., 2020, Scott et al., 2024).

7. Outlook and Limitations

Low-precision arithmetic continually advances with hardware and algorithmic innovation. Its impact is greatest in application domains where computational bottlenecks, energy consumption, and storage/memory movement dominate and modest controlled accuracy loss is permissible. However, correct implementation requires quantization-aware design, new rounding and accumulation strategies, and rigorous error modeling. Continuing efforts address probabilistic error analysis, optimal hardware support for stochastic rounding, adaptive and context-aware precision assignment, and robust mixed-precision schemes for high-stakes domains such as climate modeling, quantum simulation, and scientific computing (Sentieys et al., 2022, Dahlqvist et al., 2019, Bhola et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Low-precision Arithmetic.