Fixed-Point Data Storage Format

Updated 21 December 2025

Fixed-point data storage is a numeric representation that encodes real numbers as integer words using an implicit scaling factor, ensuring uniform quantization and deterministic hardware cost.
It employs techniques like round-to-nearest quantization and controlled bit allocation to manage errors and prevent overflow in digital signal processing and neural network applications.
Practical optimization methods, including integer linear programming and simulated annealing, balance hardware efficiency and computational accuracy for resource-constrained systems.

A fixed-point data storage format encodes real numbers as integer words with an implicit scaling factor, providing uniform quantization and deterministic hardware cost. Fixed-point (FxP) arithmetic features prominently in digital signal processing, embedded inference, and resource-constrained systems due to its predictable mapping to integer adders and multipliers, and reduced hardware requirements relative to floating-point representations (Sentieys et al., 2022, Benmaghnia et al., 2022, Herrou et al., 2024, Langroudi et al., 2018). This article develops the structural principles, mathematical properties, optimization techniques, and practical impact of fixed-point formats, referencing contemporary research and industrial application contexts.

1. Mathematical Definition and Notation

Fixed-point encoding represents a real number $x$ as a signed integer $x_\mathrm{int}$ within a fixed word length $w = m + n$ , where $m$ denotes the number of integer bits (including the sign), and $n$ the number of fractional bits. The Q-notation $Q_{m.n}$ is standard, describing a signed $w$ -bit type where the binary point lies after the $m-1$ most significant bits. Mathematically, the decoded value is

$x_\mathrm{fxpt} = x_\mathrm{int} \times 2^{-n}.$

Each bit $b_i$ in $x_\mathrm{int}$ carries weight $2^{i-n}$ , with $i$ spanning $-n$ (LSB) through $m-1$ (MSB). The representable range for signed $Q_{m.n}$ is

$[-2^{m-1},\ 2^{m-1} - 2^{-n}]$

with a quantization step ("resolution") $q = 2^{-n}$ (Sentieys et al., 2022). This directly determines the granularity and overflow behavior.

For certain application domains (e.g., neural network weights in $[-1,1]$ ), the format reduces to Q0. $n$ (only a sign and fractional bits), where the code is

$I = \mathrm{clip}\bigl(\mathrm{round}(w\,2^n), -2^n, 2^n{-}1\bigr),\ \ \hat w = I\,2^{-n}$

(Langroudi et al., 2018).

In alternative notations, such as those used in (Herrou et al., 2024), a bit vector $(x_m, x_{m-1}, ..., x_\ell)$ with indices $(m, \ell)$ encodes

$X = -2^m x_m + \sum_{i=\ell}^{m-1} 2^i x_i$

with word width $w = m - \ell + 1$ and scaling $2^\ell$ , equivalent to Q-format with suitable reinterpretation of indices.

2. Quantization, Conversion, and Error Analysis

Conversion from real-valued $x$ to fixed-point $Q_{m.n}$ quantized integer is via rounding:

$x_q = \mathrm{round}(x\,2^n)$

To decode, the integer is rescaled:

$\hat x = x_q\,2^{-n}.$

The quantization error, for round-to-nearest, is $e = \hat x - x \in [-q/2, q/2]$ . Truncation introduces a negative bias, while round-to-even (convergent rounding) further ensures zero-mean error for uniformly distributed inputs (Sentieys et al., 2022).

Quantization noise for many DSP systems is modeled as white and uniform, with variance $\sigma^2 \approx q^2/12$ . In cascaded arithmetic (L stages), worst-case error grows linearly ( $Lq$ ), and the RMS error scales as $\sqrt{L} q / \sqrt{12}$ .

Overflow is precluded by selecting $m$ so that $2^{m-1}$ exceeds the maximal absolute value, usually via:

$m = 1 + \max(\lceil \log_2 |x_{\min}|\rceil, \lfloor \log_2 |x_{\max}| \rfloor + 1)$

(Sentieys et al., 2022, Herrou et al., 2024).

3. Propagation of Format Under Arithmetic Operations

Addition and multiplication of fixed-point numbers cause word lengths and binary points to evolve predictably:

Addition ( $x\in Q_{m_1,n_1}$ , $y\in Q_{m_2,n_2}$ ): $m_\text{out} = \max(m_1, m_2)+1$ , $n_\text{out} = \max(n_1,n_2)$ .
Multiplication: $m_\text{out} = m_1 + m_2$ , $n_\text{out} = n_1 + n_2$ , total width $w_\text{out} = w_1 + w_2$ (Sentieys et al., 2022, Benmaghnia et al., 2022).

Truncation and rescaling (bit shifts) are required to control unbounded word growth, especially for multiplications where the number of fractional bits doubles; each truncation layer introduces at most half-LSB error.

Formal error bounds for composed addition and multiplication are given, for $x, y$ with respective formats $(M_x, L_x), (M_y, L_y)$ :

Addition:

$|z - \hat z| \leq 2^{-L_x} + 2^{-L_y} + 2^{-L_z}$

Multiplication:

$|z - \hat z| \leq |\hat y|2^{-L_x} + |\hat x|2^{-L_y} + 2^{-L_z}$

(Benmaghnia et al., 2022).

4. Word-Length Optimization and Format Synthesis

Selecting optimal $(m,n)$ per datum is a constrained optimization problem. The two-step pipeline in (Sentieys et al., 2022):

Integer Word-Length (IWL): Determine $m$ for each variable to guarantee no overflow, typically through static interval or affine arithmetic to bound dynamic range.
Fractional Word-Length (FWL): Choose $n$ to meet an upper bound on application-level error metrics (SNR, MSE, classification score), minimizing overall implementation cost. Objective:

$\min_{n_i} C(m_i+n_i)\quad \text{s.t.}\quad \Delta\mathrm{Quality}(n_i)\leq\mathrm{Quality}_{\max}$

This is a discrete combinatorial problem, addressed by greedy search, Lagrangian relaxation, or simulated annealing for near-Pareto-optimal (cost, error) tradeoffs.

In neural network deployment, (Benmaghnia et al., 2022) frames format assignment as an integer linear program, encoding constraints to prevent overflow, propagate bit-precision, and upper bound output error below a user-defined threshold $\epsilon$ . Integer variables track per-layer fractional bits; constraints ensure consistency, and the system is solved (e.g., via LP solvers). The solution enables code synthesis—quantizing weights/inputs, performing pure-integer arithmetic, and meeting prescribed accuracy for all permissible inputs.

A similar pipeline is applied to Faust DSP code (Herrou et al., 2024): range analysis for $m$ , and a pseudo-injectivity criterion for $\ell$ (fractional precision), ensuring distinct quantized inputs yield distinct outputs unless functionally equal. A forward pass through the signal graph infers $(m,\ell)$ per signal, with recommendations for further backward passes and loop handling.

5. Hardware Cost, Efficiency, and Performance

The hardware impact of word length $w$ is direct and predictable:

Adder: Area $\sim O(w)$ , Delay $O(\log w)$ (prefix) or $O(w)$ (ripple), Energy $\sim \alpha w C_\mathrm{load}V^2$ .
Multiplier: Area $\sim O(w^2)$ , Delay $O(w)$ , Energy $\sim O(w^2)$ .

Empirical measurements from 28nm ASIC libraries confirm this scaling: a 32-bit adder consumes $189\,\mu\text{m}^2$ and $1.06$ns, whereas a 32-bit multiplier is an order of magnitude larger and slower (Sentieys et al., 2022). In FPGAs, reducing word length by one bit reduces area and energy by 10–15% for adders, 5–10% for multipliers (Sentieys et al., 2022).

On embedded multipliers and DSPs, fixed-point add/mul consistently shows lower resource and latency cost than floating-point (e.g., in AMD Zynq FPGAs, fixed-point add uses 32 LUT/1.8ns vs. float add at 313 LUT/11.4ns; Table in (Herrou et al., 2024)).

6. Application Areas and Trade-Off Analysis

Fixed-point formats are key in embedded inference engines, safety-critical neural network deployments, and digital signal processing. The key trade-off is between hardware efficiency (area, power, cost, latency, code size) and output quality (accuracy, SNR, MSE). Figure 1 in (Sentieys et al., 2022) displays the typical Pareto frontier: for bit-widths $w < 8$ , accuracy is low but energy is minimized; for $w=10$ –$12$ ("knee point"), a modest increase sharply enhances accuracy. Beyond $w > 16$ , additional bits give diminishing returns.

Neural network inference experiments demonstrate that for LeNet, $n=6$ (7 bits, including sign) yields sub-1% accuracy loss versus full float; similarly, 9–11 bits suffice for AlexNet and CIFAR-10 ConvNets before a sharp cliff (Langroudi et al., 2018). The principal limitation of fixed-point is the uniform nature of quantization, which can be suboptimal for nonuniform parameter distributions (addressed in other numeric systems such as posit).

Automatic per-signal fixed-point format inference is supported in tools for DSP languages (Faust), including forward range and precision passes, and continues to evolve to incorporate backward propagation and tighter error models (Herrou et al., 2024).

7. Practical Guidelines, Limitations, and Future Directions

Recommended practice is to:

Use power-of-two step sizes in configuration inputs to minimize unnecessary proliferation of fractional bits (Herrou et al., 2024).
Apply combined static analysis: interval for range, pseudo-injectivity or linear error modeling for precision.
Prefer integer-only code paths once format synthesis is complete to maximize hardware efficiency (Benmaghnia et al., 2022).
Validate operators and signal chains for worst-case signal excursions to prevent unanticipated overflow.

Current limitations include the tendency of purely static/worst-case analyses to overallocate precision, challenges in capturing non-linear error propagation, and conservative defaults for recursions or feedback loops. Ongoing research seeks backward precision propagation, interval-based output error bounding, and probabilistic criteria to safely tolerate rare distinguishability loss for further bit-width reduction (Herrou et al., 2024).

Comparison to alternative number systems (e.g., posit, float16) highlights that while fixed-point is fundamentally uniform and application-constrained, it remains a first-choice for designs prioritizing quantifiable hardware efficiency and determinism.

References:

(Sentieys et al., 2022): Customizing Number Representation and Precision (Benmaghnia et al., 2022): Fixed-Point Code Synthesis For Neural Networks (Herrou et al., 2024): Towards Fixed-Point Formats Determination for Faust Programs (Langroudi et al., 2018): Deep Learning Inference on Embedded Devices: Fixed-Point vs Posit