Int/Float 4 (IF4) Format
- IF4 format is a hybrid numerical representation that integrates 32-bit lossless IEEE double compression with 4-bit adaptive quantization for efficient storage and computation.
- It leverages table lookups for precise double reconstruction and block-wise selection to minimize quantization error in both statistical computing and deep learning applications.
- Empirical evaluations show IF4 reduces mean squared error and energy usage, offering significant benefits in memory savings and hardware acceleration.
The “Int/Float 4” (IF4) format refers to modern, highly compact numerical representations that aim to combine the strengths of integer and floating-point approaches for efficient storage and computation. Contemporary use of “IF4” bifurcates into two research lineages: (1) a 4-byte (32-bit) scheme for lossless IEEE double compression (Neal, 2015) and (2) a 4-bit quantization block format for deep learning hardware and inference (Cook et al., 30 Mar 2026, Chen et al., 29 Oct 2025). Both focus on maximizing representational fidelity and efficiency within minimal bit-widths.
1. Two Main Definitions of IF4
In the literature, “IF4” is used for distinct, technically unrelated schemes:
| IF4 Variant | Bitwidth | Domain | Core Objective |
|---|---|---|---|
| 32-bit IF4 (Neal, 2015) | 32 | IEEE doubles/R | Losslessly encode decimal-like doubles in 4 bytes |
| 4-bit Adaptive IF4 (2026) | 4 | LLM quantization | Minimize quantization error by adaptively mixing FP4/INT4 |
32-bit IF4 (Lossless Compact Double Compression)
This approach encodes subsets of binary64 (IEEE “double”) in 32 bits by storing the upper half and using a small table to recover the exact mantissa (Neal, 2015).
4-bit Block-Scaled IF4 (Adaptive Quantization)
This newer variant adaptively selects, per group of 16 elements, whether to encode as FP4 or scaled INT4, optimizing for mean squared error and enabling efficient hardware support (Cook et al., 30 Mar 2026, Chen et al., 29 Oct 2025).
2. Technical Description: 32-bit IF4 (Neal, 2015)
The 32-bit IF4 format stores the high 32 bits of a binary64 float (bit endianness: sign/exponent/high-mantissa), and reconstructs the original 64-bit value using a lookup table of per-pattern low-mantissa bits.
- Bit Layout:
2
- Decoding:
The table is indexed by (a) the lowest m bits of the high mantissa and (b) e bits of the exponent at bit offset f. Each index stores the 32 low mantissa bits.
The 64-bit double is reconstructed as .
- Encoding Algorithm:
Copy the high 32 bits; for general sets, verify that re-decoding yields the same double.
- Decoding Algorithm:
Extract indices, perform table lookup, and reconstruct the double.
- Table Sizing:
Table size depends on the number of index bits, governed by target decimal patterns: 3 Example: For 6-digit decimals in any of 7 dot positions (pattern: Z), m=14, e=5 ⇒ N=524288 (2 MiB direct table).
- Supported Subsets:
Decimal data with 4, 5, or 6 significant digits, in tunable formats (e.g., dddd.dd, dd.dddd) or mixed scale/position, with lossless recovery unless two doubles would share upper-32 bits.
- Performance:
For small table sizes (≤128 KiB), decoding is within 10–80% of double arithmetic speed; for very cache-local cases, can surpass doubles in memory-bandwidth-limited settings. Outperforms decimal-float by up to 7× for typical compact-table cases. Performance degrades gracefully with larger tables or low-power architectures (Neal, 2015).
3. Technical Description: 4-bit Adaptive IF4 (Block-Scaled Quantization)
Adaptive IF4 is designed for block-wise quantization in deep learning, particularly in the context of LLMs and hardware accelerators. It combines per-block selection of FP4 or INT4 representations, sharing the E4M3 per-block scale with the data-type selection encoded in the sign bit (Cook et al., 30 Mar 2026).
- Element Format:
- 4 bits per data element:
- If FP4: E2M1 (1 sign, 2 exponent, 1 mantissa)
- If INT4: 4-bit 2’s-complement, range [–7, +7]
- Block Organization:
- 16 values per block.
- Shared 8-bit E4M3 block scale:
- Bit 7: 0 (FP4 block) or 1 (INT4 block)
- Lower 7 bits: 4 exponent, 3 mantissa.
- Quantization Process:
- Compute (tensor-wide FP32 scale):
- For each block, compute block scale in E4M3:
- Quantize the block both as FP4 and as (pre-scaled) INT4, compute MSEs: \begin{align*} E{(\mathrm{FP})} &= \frac{1}{16}\sum_j (\hat X_j{(\mathrm{FP})} - X_j)2 \ E{(\mathrm{INT})} &= \frac{1}{16}\sum_j (\hat X_j{(\mathrm{INT})} - X_j)2 \end{align*}
- Select format with lower MSE; encode using block scale sign.
- Dequantization:
- FP4:
- INT4:
- Storage Overhead:
Identical to NVFP4: 4 bits per value + 8 bits per block (no additional metadata) (Cook et al., 30 Mar 2026).
4. Hardware and Algorithmic Recommendations
Experimental and architecture explorations recommend:
- Unified datapaths using 4-bit two’s-complement INT4 as the baseline codebook.
- Per-block E4M3 scale as primary, with an optional global FP32 scale.
- Mixed-mode multiply-accumulate (MAC) units: decode each operand as FP4 or INT4 depending on the block’s encoded mode. Addition of a single FP multiplier and control logic is sufficient, with latency overhead ≈5%.
- Hardware primitives for Hadamard block-rotation further improve INT4 quantization fidelity for outlier-laden blocks (Chen et al., 29 Oct 2025).
5. Empirical Performance and Accuracy Trade-Offs
Quantitative findings for blockwise quantization in neural models:
| Format | Group | Scale | MSE | Dynamic Range |
|---|---|---|---|---|
| NVFP4 | 16 | E4M3 | 9.0 | |
| NVINT4 | 16 | E4M3 | 7.4 | 0 |
| IF4 | 16 | UE4M3 | 6.2 | 1 |
Adaptive IF4 yields a ≈31% reduction in MSE relative to NVFP4, and outperforms both NVFP4 and NVINT4 in quantized training loss, WikiText/C4 perplexity, and downstream classification accuracy (e.g., ARC, PIQA, LAMBADA) (Cook et al., 30 Mar 2026).
Applying block-wise Hadamard rotation increases INT4 quantization SNR above FP4 by lowering the crest factor; with rotation, INT4 matches or exceeds FP4 fidelity as measured by KL divergence and quantization SNR (Chen et al., 29 Oct 2025).
Energy and silicon area for blockwise INT4 quantization are 34% and 38% that of NVFP4, respectively. When combining 8b+4b formats, energy/area drops by a further ≈25% (Chen et al., 29 Oct 2025).
6. Applications and Integration
- 32-bit IF4 (lossless double schemes): Used in statistical computing (e.g., R interpreter pqR) for invisible data compression with fallback to full doubles. Allows significant memory savings (50%) with CPU overhead ranging 0–200% depending on table size and hardware (Neal, 2015).
- 4-bit IF4 (quantization): Deployed in LLM quantization pipelines for both inference and training. IF4 can be integrated into AI hardware accelerators (MAC units), and is compatible with post-training quantization and mixed-precision training (Cook et al., 30 Mar 2026, Chen et al., 29 Oct 2025).
7. Limitations and Future Directions
Both IF4 schemes are inherently subset-oriented. The 32-bit IF4 only encodes numbers with a small number of decimal digits or simple rational structure; arbitrary IEEE doubles must fall back to 64-bit. Scalability in blockwise IF4 is governed by representable block distributions and available dynamic range. Table size and hardware area must be proactively managed, with the full accuracy benefit only realized for data that fits IF4’s mixed distributional assumptions.
A plausible implication is that further research into per-block or per-tensor adaptation—beyond just the FP4-INT4 dichotomy—may yield additional accuracy and efficiency gains, especially in the presence of activation outliers or nonstandard value distributions. Ongoing hardware co-design efforts seek to provide instruction-level IF4 support and seamless fallback to higher precisions as needed (Chen et al., 29 Oct 2025).