- The paper proposes a novel, overlapping protection methodology that integrates ECC, TMR, and lockstep techniques to secure both component and checker logic.
- It achieves over 99.9% SEU fault coverage and a 99.7% correction rate for voter errors, reducing area costs by 22% compared to full global TMR.
- Comprehensive fault injection experiments validate that overlapping protection domains mitigate vulnerabilities in critical SoC areas for safety-intensive applications.
Enhancing Component-Level Architectural SEU Fault Tolerance for End-to-End SoC Protection
This essay provides a technical summary and analysis of "Who Checks the Checker? Enhancing Component-level Architectural SEU Fault Tolerance for End-to-End SoC Protection" (2603.26637), focusing on the architectural innovations, experimental evaluation, and implications of component-level protection mechanisms for SEU resilience in system-on-chip (SoC) designs.
Motivation and Problem Statement
SEU resilience is paramount for SoCs in radiation-heavy or safety-critical environments such as space, avionics, and high-energy physics. Traditional architectural approaches often protect individual SoC components (e.g., cores, SRAM, interconnect) in isolation. However, such approaches neglect critical vulnerabilities in inter-component logic—including bus interconnects, voters, and encoders/decoders—which become single points of failure. Fine-grained triplication (e.g., TMR on the entire RTL with global voting) provides high reliability but induces expensive area, timing, and power overheads. The central question addressed is how to architect SoC-level SEU protection that efficiently closes all vulnerability gaps without incurring the prohibitive costs of global TMR.
Novel Overlapping Fault-Tolerance Architecture
The proposed solution is a systematic, component-specific protection architecture with enforced overlap at protection domain boundaries, demonstrated on "croc," an open-source RISC-V microcontroller SoC.
Component-wise Protection:
- SRAM Memories: SECDED ECC (Hsiao code) with scrubbing.
- Processor Core: Triple-core lockstep (TCLS) with voter outputs and software resynchronization routines.
- Bus Interconnect: relOBI protocol combining TMR on handshake/control signals and ECC on data/address lines.
- Peripherals and Registers: Full TMR using Triple Modular Redundancy Generators (TMRG).
The architectural novelty is in overlapping protection domains (see Figure 1 and Figure 2): e.g., replicating bus protocol encoders/decoders within the lockstepped region ensures that the voter/encoder/decoder logic is not a single point of failure. Voters and ECC decoders are incorporated within adjacent protection domains and triplicated, ensuring failures in one protection domain or its checker are covered by redundancy in the adjoining domain.

Figure 1: Block diagram of the croc architecture, with protected regions highlighted (b).
Figure 2: Illustration of overlapping protection methods at the SoC architecture boundaries, preventing gaps in coverage.
Implementation and Experimental Methodology
A full RTL-to-layout flow using Yosys and OpenROAD is employed, targeting an IHP 130 nm PDK at 60 MHz, with modular triplication enforced at the hierarchy level. The approach is validated with concurrent fault injection (Synopsys VC Z01X), targeting:
- Single flip-flop (FF) upsets (to simulate SEUs)
- Single-event transients in combinational logic (gate-level, mapped netlist)
- Faults in SRAM cells
For each configuration incrementally adding further protection, 100,000 random faults are injected per scenario, and the functional consequences (e.g., masked, corrected, uncorrectable, latent, system failure) are tracked to determine fault coverage.
Quantitative Results
Area and Performance:
- Reference design: Baseline, 1.06 mm² core area.
- Full protection (overlapping approach): 2.87 mm² (+171% over baseline), supporting a maximum frequency of 62 MHz.
- Global full-TMR (using TMRG): 3.68 mm² (+248% over baseline, or +28% over overlapping approach), with worse routability and timing.




Figure 3: Annotated layout renders for all configurations, showing increasing area overhead with incremental protection.
Fault Tolerance Metrics:
Who Checks the Checker?
- Fault injection into the output of TCLS majority voters or ECC decoders (the classic "checker vulnerability") reveals that, without protection overlap, voters become an SEU failure point (16.3% failure). With overlap, injected errors are masked/corrected in 99.7% of cases, demonstrating that overlapping domains provide end-to-end protection, including for the voters themselves.
Implications and Comparisons
Architectural Implications:
- Overlapping domain protection changes the Pareto frontier for SoC SEU resilience: individually tailored, overlapping methods yield higher fault coverage per area than monolithic TMR.
- Modular application of ECC, TMR, and lockstep appropriately matches the fault model and criticality per subcomponent, reducing redundant logic.
- System-level SEU vulnerability can be further reduced by enforcing overlap in all checker/voter logic at domain boundaries.
Comparison to State-of-the-Art:
- Fine-grained global TMR achieves near-identical SEU coverage but with significantly worse area, routability, and timing overheads.
- Previous non-overlapping solutions are demonstrated to leave significant system-level vulnerability—most notably, at fault-tolerance "gaps" (e.g., lockstep voters or ECC decoders).
Limitations and Open Questions
- Remaining uncorrected failures are confined to unprotected/isolated logic or output paths beyond TMR voters; this marks the practical limit for logic-level fault tolerance without hardening at the I/O/pad circuitry.
- The approach relies on architectural redundancy and is thus agnostic of underlying process hardening, but physical design and implementation constraints become more severe as redundancy and overlapping boundaries increase complexity and reduce optimization flexibility.
- The scheme targets single faults and accumulation, but multiple simultaneous SEUs (MBUs) or common-mode failures remain a challenge and may warrant further architectural diversity [Mitra/Common-Mode, 2000].
Impact and Directions for AI/SoC Development
- The work concretely advances the field of dependable embedded SoCs, particularly for edge, automotive, and space applications, where area and reliability trade-offs are paramount.
- For AI accelerators and heterogeneous SoCs, this approach enables practical, scalable design of multicore or accelerator tiles with tunable resilience and selective hardening, minimizing performance/area impact.
- As the complexity of SoC interconnects and accelerator fabrics (e.g., AI cluster tiles) increases, the methodology of protection domain overlap—coupled with cross-hierarchical monitoring—will be increasingly critical.
- Future research may explore automated synthesis tooling for recognition and overlapping of protection domains, more granular coverage metric synthesis, and extending this methodology to emerging 3D and chiplet-based architectures, as well as synergy with hardware-level diversity for common-mode failure mitigation.
Conclusion
The "Who Checks the Checker?" paper (2603.26637) establishes an efficient, modular, and highly effective methodology for SoC-level SEU fault tolerance, closing architectural protection gaps through enforced overlap of component-appropriate protection domains. Quantitative results demonstrate over 99.9% fault coverage with 22% less area overhead than global TMR approaches. These findings provide a strong architectural foundation for reliable SoC design in safety- and mission-critical applications, inviting further exploration into automated domain-overlap generation and integration with evolving AI and heterogeneous architectures.