From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU

Published 12 Apr 2026 in cs.AR | (2604.10494v1)

Abstract: Block Floating-Point (BFP) is emerging as an attractive data format for edge Neural Processing Units (NPUs), combining wide dynamic range with high hardware efficiency. However, its behavior under hardware faults and suitability for safety-critical deployments remain underexplored. Here, we present the first in-depth empirical reliability study of BFP-based NPUs. Using RTL-level fault injection on NPUs, our bit- and path-level analysis reveals pronounced heterogeneous vulnerabilities and shows conventional end-to-end check becomes ineffective under nonlinear block scaling. Guided by these insights, we design a fault-tolerant BFP-based NPU microarchitecture that aligns the BFP computational semantics with reliability constraints. The design uses a row/column-wise blocking strategy to decouple the fixed-point mantissa computations from the scalar exponent path, and introduces ultra-lightweight protection mechanisms for each. Experimental results demonstrate our design achieves near-dual modular redundancy reliability with only $3.55\%$ geometric mean performance overhead and less than $2\%$ hardware cost.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a comprehensive empirical analysis of fault vulnerabilities in BFP computation, emphasizing exponent and high-significance mantissa faults.
The paper introduces a specialized microarchitecture featuring ABFT and temporal redundancy, achieving near-DMR reliability with minimal overhead.
The paper demonstrates that BFP-specific hardware protection strategies significantly outperform traditional FP methods in safety-critical NPU deployments.

Reliable Microarchitecture for Block Floating-Point NPUs: A Technical Analysis

Introduction

Block Floating-Point (BFP) format is gaining prominence in edge Neural Processing Units (NPUs) due to its hardware efficiency and wide dynamic range compatibility. Unlike standard floating-point (FP) formats, BFP assigns a single shared exponent to a block of elements, decoupling exponent and mantissa handling to enable high-throughput fixed-point arithmetic for the mantissas. Despite its increasing adoption in modern accelerators, such as NVIDIA Blackwell GPUs and Tenstorrent AI products, the reliability of BFP-based NPUs under hardware faults—essential for safety-critical deployments—remained underexplored. This paper provides the first in-depth empirical study of BFP reliability and introduces a microarchitecture and hardware protection methodology specifically attuned to BFP computation semantics, demonstrating near-dual modular redundancy (DMR) reliability at a fraction of the cost.

Figure 1: The failure of conventional ABFT-style end-to-end protection methods for BFP workloads as opposed to INT8, highlighting increased false positives and ineffectiveness as block-based scaling increases.

BFP Format and Reliability Challenges

The BFP format shares a single exponent ( $e_{sh}$ ) among all block elements, while each element maintains its own mantissa. Computation decomposes naturally: exponent addition for the block and a fixed-point dot product for mantissas. This structure, highly compatible with systolic arrays, enables significant hardware simplifications. However, the block-level exponent sharing creates unique, vulnerability-prone failure modes:

Exponent bit flips: A single fault affects all block elements, massively perturbing block values.
Mantissa bit flips: Alignment shifts create extended leading zeros, making high-significance mantissa bits acutely vulnerable due to their impact on block normalization and error propagation.

Conventional end-to-end fault tolerance approaches (e.g., ABFT designed for INT8/FP) are inadequate for BFP, as shown by the rapid rise in false positives and the inability of these schemes to address BFP-specific error amplification. This is illustrated in (Figure 1).

Empirical Fault Characterization in BFP-Based NPUs

The paper systematically evaluates fault resilience using RTL-level fault injection across DNN and LLM workloads. Key findings:

At low fault rates ( $<10^{-9}$ ), model performance remains unaffected, but performance degrades sharply above this threshold, with LLMs failing catastrophically beyond $10^{-8}$ (Figure 2).
In non-compute modules (e.g., SRAMs, buffers), exponent bit faults dominate error propagation; mantissa faults rarely lead to catastrophic failures.
In compute modules (MAC pipelines), both exponent and high-order mantissa faults in BFP cause substantially larger accuracy loss and output deviations than in traditional FP, due to shared exponent error amplification and leading-zero normalization effects (Figure 3).

Figure 2: DNNs under different fault rates reveal sharp degradation in model accuracy at fault rates beyond $10^{-9}$ .

Figure 3: The leading-zero distribution in BFP amplifies bit flip-induced errors in the mantissa compared to FP formats, with normalization exacerbating error propagation.

These observations lead to three critical insights:

BFP offers no intrinsic fault resilience—dedicated protection is mandatory.
Module- and bit-level vulnerabilities in BFP require differentiated hardening strategies (exponent bits in storage, exponent and high-order mantissa in compute paths).
Hardware mapping for BFP must explicitly align with reliability constraints, combining fine-grained protection and computational efficiency.

Microarchitecture and Protection Co-design

The proposed microarchitecture leverages BFP's natural alignment with row/column-wise blocking, mapping mantissa computations onto the systolic array (enabling efficient fixed-point ABFT checking) and exponent processing onto a decoupled, low-area adder pipeline, which exploits timing slack for recompute-and-compare checking. The data format converters (FP-to-BFP and BFP-to-FP) are lightweight but critical; DMR is employed for maximal reliability with negligible overhead.

Figure 4: The proposed top-level BFP microarchitecture, showing specialization and dedicated protection for mantissa computation, exponent computation, and format conversion modules, each tailored to the module's dominant fault vulnerabilities.

Mantissa Compute Module

Protection: Hardware ABFT is tailored for fixed-point operations within systolic arrays. The solution introduces negligible pipeline delay (two cycles) and focuses error detection on the high-significance mantissa regions most critical in BFP (Figure 5).
Figure 5: Mathematical and microarchitectural integration of ABFT for mantissa protection in WS-M and OS-M regimes.

Exponent Compute Module

Protection: Temporal redundancy leverages available timing slack to perform exponent computation twice per result, using operand ring-buffer manipulation to maximize coverage and detect both transient and permanent faults (Figure 6).
Figure 6: Exponent computation unit with specialized registers and recompute-and-compare logic utilizing time-slack.

Data Format Converters

Protection: DMR-based consistency checks are preferred due to ultra-small area, with targeted bit-level checking on mantissa normalization windows (Figure 7).
Figure 7: Microarchitecture for resilient FP-to-BFP and BFP-to-FP conversion employing DMR and focused mantissa checking.

Evaluation

Experiments are conducted on a Gemmini-based NPU with FPGA prototype and software-simulated LLMs. Benchmarks across DNNs (ResNet, MobileNet, AlexNet) and LLMs (Llama3, OPT) establish the approach's robustness under industry-representative workloads.

Major results:

Performance overhead is minimal: the proposed method imposes only 3.55% (geomean) overhead versus 20–132% for DMR and 10–70% for IR (Instruction Redundancy) (Figure 8).
Detection coverage remains above 98% in all tested conditions, including high fault rates (Figure 9).
Model accuracy is comparable to DMR, even at high fault rates, and consistently outperforms IR and unprotected baselines in both DNN and LLM domains (Figures 11 and 12).
All error detection latencies are within sub-microsecond range and hardware overhead is kept at $<$ 2\% area and $\sim$ 3\% power for large arrays.
Figure 8: Minor performance overhead induced by the proposed strategies compared to heavy redundancy schemes.

Figure 9: Error detection coverage exceeds 98% and remains steady across scales and fault rates.

Figure 10: DNN model accuracy profile under different protection mechanisms, with the proposed method closely tracking DMR performance.

Figure 11: LLM perplexity under fault injection and various protection strategies, illustrating the method's efficacy.

Practical and Theoretical Implications

This research demonstrates that BFP-specific hardware mapping and fault tolerance enable the deployment of BFP-based NPUs in reliability-constrained environments (e.g., automotive or industrial control). By leveraging computational semantics, the approach minimizes area, power, and latency costs while maintaining near-DMR reliability. Theoretically, the results challenge assumptions regarding the straightforward transplantability of FP/INT fault-tolerance techniques to block-based formats and indicate that co-design at the data format, microarchitectural, and protection levels is mandatory.

Future developments could include:

Refinement of adaptive blocking strategies based on fault profiles and model requirements.
Integration with process- and system-level resilience management (e.g., cross-layer reliability).
Automated sensitivity analysis for dynamic hardening at runtime in variable fault environments.

Conclusion

The paper provides a rigorous reliability characterization of BFP-based NPU architectures, revealing dominant failure modes arising from BFP's block-centric arithmetic and shared exponent semantics. It introduces a cost-effective hardware protection methodology, grounded in fine-grained vulnerability analysis and BFP-aware co-design. The proposed microarchitecture achieves near-DMR error detection and correction efficacy with minimal compute, area, and power overhead, rendering BFP-based NPUs suitable for safety-critical AI workloads (2604.10494).

Markdown Report Issue