- The paper presents a comprehensive empirical analysis of fault vulnerabilities in BFP computation, emphasizing exponent and high-significance mantissa faults.
- The paper introduces a specialized microarchitecture featuring ABFT and temporal redundancy, achieving near-DMR reliability with minimal overhead.
- The paper demonstrates that BFP-specific hardware protection strategies significantly outperform traditional FP methods in safety-critical NPU deployments.
Reliable Microarchitecture for Block Floating-Point NPUs: A Technical Analysis
Introduction
Block Floating-Point (BFP) format is gaining prominence in edge Neural Processing Units (NPUs) due to its hardware efficiency and wide dynamic range compatibility. Unlike standard floating-point (FP) formats, BFP assigns a single shared exponent to a block of elements, decoupling exponent and mantissa handling to enable high-throughput fixed-point arithmetic for the mantissas. Despite its increasing adoption in modern accelerators, such as NVIDIA Blackwell GPUs and Tenstorrent AI products, the reliability of BFP-based NPUs under hardware faults—essential for safety-critical deployments—remained underexplored. This paper provides the first in-depth empirical study of BFP reliability and introduces a microarchitecture and hardware protection methodology specifically attuned to BFP computation semantics, demonstrating near-dual modular redundancy (DMR) reliability at a fraction of the cost.

Figure 1: The failure of conventional ABFT-style end-to-end protection methods for BFP workloads as opposed to INT8, highlighting increased false positives and ineffectiveness as block-based scaling increases.
The BFP format shares a single exponent (esh​) among all block elements, while each element maintains its own mantissa. Computation decomposes naturally: exponent addition for the block and a fixed-point dot product for mantissas. This structure, highly compatible with systolic arrays, enables significant hardware simplifications. However, the block-level exponent sharing creates unique, vulnerability-prone failure modes:
- Exponent bit flips: A single fault affects all block elements, massively perturbing block values.
- Mantissa bit flips: Alignment shifts create extended leading zeros, making high-significance mantissa bits acutely vulnerable due to their impact on block normalization and error propagation.
Conventional end-to-end fault tolerance approaches (e.g., ABFT designed for INT8/FP) are inadequate for BFP, as shown by the rapid rise in false positives and the inability of these schemes to address BFP-specific error amplification. This is illustrated in (Figure 1).
Empirical Fault Characterization in BFP-Based NPUs
The paper systematically evaluates fault resilience using RTL-level fault injection across DNN and LLM workloads. Key findings:
- At low fault rates (<10−9), model performance remains unaffected, but performance degrades sharply above this threshold, with LLMs failing catastrophically beyond 10−8 (Figure 2).
- In non-compute modules (e.g., SRAMs, buffers), exponent bit faults dominate error propagation; mantissa faults rarely lead to catastrophic failures.
- In compute modules (MAC pipelines), both exponent and high-order mantissa faults in BFP cause substantially larger accuracy loss and output deviations than in traditional FP, due to shared exponent error amplification and leading-zero normalization effects (Figure 3).





Figure 2: DNNs under different fault rates reveal sharp degradation in model accuracy at fault rates beyond 10−9.
Figure 3: The leading-zero distribution in BFP amplifies bit flip-induced errors in the mantissa compared to FP formats, with normalization exacerbating error propagation.
These observations lead to three critical insights:
- BFP offers no intrinsic fault resilience—dedicated protection is mandatory.
- Module- and bit-level vulnerabilities in BFP require differentiated hardening strategies (exponent bits in storage, exponent and high-order mantissa in compute paths).
- Hardware mapping for BFP must explicitly align with reliability constraints, combining fine-grained protection and computational efficiency.
Microarchitecture and Protection Co-design
The proposed microarchitecture leverages BFP's natural alignment with row/column-wise blocking, mapping mantissa computations onto the systolic array (enabling efficient fixed-point ABFT checking) and exponent processing onto a decoupled, low-area adder pipeline, which exploits timing slack for recompute-and-compare checking. The data format converters (FP-to-BFP and BFP-to-FP) are lightweight but critical; DMR is employed for maximal reliability with negligible overhead.
Figure 4: The proposed top-level BFP microarchitecture, showing specialization and dedicated protection for mantissa computation, exponent computation, and format conversion modules, each tailored to the module's dominant fault vulnerabilities.
Mantissa Compute Module
Exponent Compute Module
Evaluation
Experiments are conducted on a Gemmini-based NPU with FPGA prototype and software-simulated LLMs. Benchmarks across DNNs (ResNet, MobileNet, AlexNet) and LLMs (Llama3, OPT) establish the approach's robustness under industry-representative workloads.
Major results:
- Performance overhead is minimal: the proposed method imposes only 3.55% (geomean) overhead versus 20–132% for DMR and 10–70% for IR (Instruction Redundancy) (Figure 8).
- Detection coverage remains above 98% in all tested conditions, including high fault rates (Figure 9).
- Model accuracy is comparable to DMR, even at high fault rates, and consistently outperforms IR and unprotected baselines in both DNN and LLM domains (Figures 11 and 12).
- All error detection latencies are within sub-microsecond range and hardware overhead is kept at <2\% area and ∼3\% power for large arrays.
Figure 8: Minor performance overhead induced by the proposed strategies compared to heavy redundancy schemes.
Figure 9: Error detection coverage exceeds 98% and remains steady across scales and fault rates.
Figure 10: DNN model accuracy profile under different protection mechanisms, with the proposed method closely tracking DMR performance.
Figure 11: LLM perplexity under fault injection and various protection strategies, illustrating the method's efficacy.
Practical and Theoretical Implications
This research demonstrates that BFP-specific hardware mapping and fault tolerance enable the deployment of BFP-based NPUs in reliability-constrained environments (e.g., automotive or industrial control). By leveraging computational semantics, the approach minimizes area, power, and latency costs while maintaining near-DMR reliability. Theoretically, the results challenge assumptions regarding the straightforward transplantability of FP/INT fault-tolerance techniques to block-based formats and indicate that co-design at the data format, microarchitectural, and protection levels is mandatory.
Future developments could include:
- Refinement of adaptive blocking strategies based on fault profiles and model requirements.
- Integration with process- and system-level resilience management (e.g., cross-layer reliability).
- Automated sensitivity analysis for dynamic hardening at runtime in variable fault environments.
Conclusion
The paper provides a rigorous reliability characterization of BFP-based NPU architectures, revealing dominant failure modes arising from BFP's block-centric arithmetic and shared exponent semantics. It introduces a cost-effective hardware protection methodology, grounded in fine-grained vulnerability analysis and BFP-aware co-design. The proposed microarchitecture achieves near-DMR error detection and correction efficacy with minimal compute, area, and power overhead, rendering BFP-based NPUs suitable for safety-critical AI workloads (2604.10494).