Who Checks the Checker? Enhancing Component-level Architectural SEU Fault Tolerance for End-to-End SoC Protection

Published 27 Mar 2026 in cs.AR | (2603.26637v1)

Abstract: Single-event upset (SEU) fault tolerance for systems-on-chip (SoCs) in radiation-heavy environments is often addressed by architectural fault-tolerance approaches protecting individual SoC components (e.g., cores, memories) in isolation. However, the protection of voting logic and interconnections among components is also critical, as these become single points of failure in the design. We investigate combining multiple fault-tolerance approaches targeting individual SoC components, including interconnect and voting logic to ensure end-to-end SoC-level architectural SEU fault tolerance, while minimizing implementation area overheads. Enforcing an overlap between the protection methods ensures hardening of the whole design without gaps, while curtailing overheads. We demonstrate our approach on a RISC-V microcontroller SoC. SEU fault-tolerance is assessed with simulation-based fault injection. Overheads are assessed with full physical implementation. Tolerance to over 99.9% of faults in both RTL and implemented netlist is demonstrated. Furthermore, the design exhibits 22% lower implementation overhead compared to a single global fault-tolerance method, such as fine-grained triplication.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper proposes a novel, overlapping protection methodology that integrates ECC, TMR, and lockstep techniques to secure both component and checker logic.
It achieves over 99.9% SEU fault coverage and a 99.7% correction rate for voter errors, reducing area costs by 22% compared to full global TMR.
Comprehensive fault injection experiments validate that overlapping protection domains mitigate vulnerabilities in critical SoC areas for safety-intensive applications.

Enhancing Component-Level Architectural SEU Fault Tolerance for End-to-End SoC Protection

This essay provides a technical summary and analysis of "Who Checks the Checker? Enhancing Component-level Architectural SEU Fault Tolerance for End-to-End SoC Protection" (2603.26637), focusing on the architectural innovations, experimental evaluation, and implications of component-level protection mechanisms for SEU resilience in system-on-chip (SoC) designs.

Motivation and Problem Statement

SEU resilience is paramount for SoCs in radiation-heavy or safety-critical environments such as space, avionics, and high-energy physics. Traditional architectural approaches often protect individual SoC components (e.g., cores, SRAM, interconnect) in isolation. However, such approaches neglect critical vulnerabilities in inter-component logic—including bus interconnects, voters, and encoders/decoders—which become single points of failure. Fine-grained triplication (e.g., TMR on the entire RTL with global voting) provides high reliability but induces expensive area, timing, and power overheads. The central question addressed is how to architect SoC-level SEU protection that efficiently closes all vulnerability gaps without incurring the prohibitive costs of global TMR.

Novel Overlapping Fault-Tolerance Architecture

The proposed solution is a systematic, component-specific protection architecture with enforced overlap at protection domain boundaries, demonstrated on "croc," an open-source RISC-V microcontroller SoC.

Component-wise Protection:

SRAM Memories: SECDED ECC (Hsiao code) with scrubbing.
Processor Core: Triple-core lockstep (TCLS) with voter outputs and software resynchronization routines.
Bus Interconnect: relOBI protocol combining TMR on handshake/control signals and ECC on data/address lines.
Peripherals and Registers: Full TMR using Triple Modular Redundancy Generators (TMRG).

The architectural novelty is in overlapping protection domains (see Figure 1 and Figure 2): e.g., replicating bus protocol encoders/decoders within the lockstepped region ensures that the voter/encoder/decoder logic is not a single point of failure. Voters and ECC decoders are incorporated within adjacent protection domains and triplicated, ensuring failures in one protection domain or its checker are covered by redundancy in the adjoining domain.

Figure 1: Block diagram of the croc architecture, with protected regions highlighted (b).

Figure 2: Illustration of overlapping protection methods at the SoC architecture boundaries, preventing gaps in coverage.

Implementation and Experimental Methodology

A full RTL-to-layout flow using Yosys and OpenROAD is employed, targeting an IHP 130 nm PDK at 60 MHz, with modular triplication enforced at the hierarchy level. The approach is validated with concurrent fault injection (Synopsys VC Z01X), targeting:

Single flip-flop (FF) upsets (to simulate SEUs)
Single-event transients in combinational logic (gate-level, mapped netlist)
Faults in SRAM cells

For each configuration incrementally adding further protection, 100,000 random faults are injected per scenario, and the functional consequences (e.g., masked, corrected, uncorrectable, latent, system failure) are tracked to determine fault coverage.

Quantitative Results

Area and Performance:

Reference design: Baseline, 1.06 mm² core area.
Full protection (overlapping approach): 2.87 mm² (+171% over baseline), supporting a maximum frequency of 62 MHz.
Global full-TMR (using TMRG): 3.68 mm² (+248% over baseline, or +28% over overlapping approach), with worse routability and timing.

Figure 3: Annotated layout renders for all configurations, showing increasing area overhead with incremental protection.

Fault Tolerance Metrics:

ECC-protected SRAM: Achieves >99% recovery rate for SRAM upsets, eliminates single-word SEU failures; area cost is modest.
Adding TCLS cores: Reduces non-memory SEU/system failure rates by 6.2× (FF faults) and 5.7× (gate-level netlist).
relOBI interconnect: Further reduction in residual system failure rates; increases correction rate, handles faults in handshake and address/data/control logic.
Peripheral TMR: Final configuration achieves >99.9% fault coverage for all fault models; residual failures only occur in non-critical, unprotected debug modules or at combinational output drivers.
Compared to coarse-grained TMR: Nearly identical SEU coverage, but with 22% lower area cost and improved design scalability.
Figure 4: Fault injection results confirming high correction rates in protected domains, even for faults injected directly into voter circuits.

Who Checks the Checker?

Fault injection into the output of TCLS majority voters or ECC decoders (the classic "checker vulnerability") reveals that, without protection overlap, voters become an SEU failure point (16.3% failure). With overlap, injected errors are masked/corrected in 99.7% of cases, demonstrating that overlapping domains provide end-to-end protection, including for the voters themselves.

Implications and Comparisons

Architectural Implications:

Overlapping domain protection changes the Pareto frontier for SoC SEU resilience: individually tailored, overlapping methods yield higher fault coverage per area than monolithic TMR.
Modular application of ECC, TMR, and lockstep appropriately matches the fault model and criticality per subcomponent, reducing redundant logic.
System-level SEU vulnerability can be further reduced by enforcing overlap in all checker/voter logic at domain boundaries.

Comparison to State-of-the-Art:

Fine-grained global TMR achieves near-identical SEU coverage but with significantly worse area, routability, and timing overheads.
Previous non-overlapping solutions are demonstrated to leave significant system-level vulnerability—most notably, at fault-tolerance "gaps" (e.g., lockstep voters or ECC decoders).

Limitations and Open Questions

Remaining uncorrected failures are confined to unprotected/isolated logic or output paths beyond TMR voters; this marks the practical limit for logic-level fault tolerance without hardening at the I/O/pad circuitry.
The approach relies on architectural redundancy and is thus agnostic of underlying process hardening, but physical design and implementation constraints become more severe as redundancy and overlapping boundaries increase complexity and reduce optimization flexibility.
The scheme targets single faults and accumulation, but multiple simultaneous SEUs (MBUs) or common-mode failures remain a challenge and may warrant further architectural diversity [Mitra/Common-Mode, 2000].

Impact and Directions for AI/SoC Development

The work concretely advances the field of dependable embedded SoCs, particularly for edge, automotive, and space applications, where area and reliability trade-offs are paramount.
For AI accelerators and heterogeneous SoCs, this approach enables practical, scalable design of multicore or accelerator tiles with tunable resilience and selective hardening, minimizing performance/area impact.
As the complexity of SoC interconnects and accelerator fabrics (e.g., AI cluster tiles) increases, the methodology of protection domain overlap—coupled with cross-hierarchical monitoring—will be increasingly critical.
Future research may explore automated synthesis tooling for recognition and overlapping of protection domains, more granular coverage metric synthesis, and extending this methodology to emerging 3D and chiplet-based architectures, as well as synergy with hardware-level diversity for common-mode failure mitigation.

Conclusion

The "Who Checks the Checker?" paper (2603.26637) establishes an efficient, modular, and highly effective methodology for SoC-level SEU fault tolerance, closing architectural protection gaps through enforced overlap of component-appropriate protection domains. Quantitative results demonstrate over 99.9% fault coverage with 22% less area overhead than global TMR approaches. These findings provide a strong architectural foundation for reliable SoC design in safety- and mission-critical applications, inviting further exploration into automated domain-overlap generation and integration with evolving AI and heterogeneous architectures.

Markdown Report Issue