Design and Experimental Investigation of Trikarenos: A Fault-Tolerant 28nm RISC-V-based SoC (2407.05938v1)
Abstract: We present a fault-tolerant by-design RISC-V SoC and experimentally assess it under atmospheric neutrons and 200 MeV protons. The dedicated ECC and Triple-Core Lockstep countermeasures correct most errors, guaranteeing a device cross-section lower than $5.36 \times 10{-12}$ cm$2$.
Summary
- The paper introduces Trikarenos, a 28nm RISC-V SoC that integrates TCLS and ECC to mitigate radiation-induced errors.
- It employs extensive neutron and proton radiation testing at ChipIR and HollandPTC to validate its reliability under harsh conditions.
- The experimental results indicate a high Mean Time To Failure and minimal system crashes, confirming its suitability for critical applications.
Design and Experimental Investigation of Trikarenos: A Fault-Tolerant 28nm RISC-V-based SoC
The paper under review presents a fault-tolerant system-on-chip (SoC) named Trikarenos based on the RISC-V architecture. The significance of this work lies in addressing the reliability requirements for automotive and space applications, focusing on mitigation techniques against ionizing radiation, a persistent concern for electronic components in these critical domains.
RISC-V Architecture and Fault Tolerance Mechanisms
Trikarenos leverages the RISC-V Instruction Set Architecture (ISA), renowned for its flexibility and open ecosystem. The SoC integrates significant fault-tolerant mechanisms, including Triple-Core Lockstep (TCLS) and Error Correction Codes (ECC) to mitigate Single Event Effects (SEEs) such as Single Event Upsets (SEUs), Single Event Transients (SETs), and more. The TCLS mechanism is critical for Trikarenos, enabling the system to maintain operational integrity by synchronizing three physically separated cores and correcting any discrepancies that arise due to radiation strikes.
Experimental Evaluation
The paper's primary contributions rest on the experimental evaluation of Trikarenos under neutron and proton radiation environments. The evaluations were conducted at the ChipIR and HollandPTC facilities to simulate atmospheric and orbital conditions, respectively.
SRAM Vulnerability and ECC Effectiveness
Trikarenos incorporates \SI{256}{\kibi\byte} of static RAM (SRAM) protected by Single Error Correction, Double Error Detection (SECDED) ECC. The experimental results indicated a bit error rate of \SI{1.92e-04}{\error\per\bit\per\hour} for neutrons and \SI{9180}{\error\per\hour} for protons, with respective cross-sections per bit of \SI{1.08(0.01)e-14}{\centi\meter\squared\per\bit} and \SI{1.12(0.01)e-15}{\centi\meter\squared\per\bit}. These findings validate the ECC's efficacy in correcting SRAM errors induced by radiation.
TCLS Mechanism Performance
The TCLS mechanism's effectiveness was demonstrated through the negligible number of system crashes observed during the tests. The recorded cross-section for correctable TCLS errors was \SI{2.55(0.68)e-11}{\centi\meter\squared} for neutrons and \SI{5.25(1.51)e-12}{\centi\meter\squared} for protons. This signifies that the TCLS mechanism can significantly enhance system reliability in radiation-prone environments. Notably, the minimum Mean Time To Failure (MTTF) was estimated to exceed 1.06 million years in a terrestrial setting, highlighting the robustness of the fault-tolerant design.
Methodological Rigor
The experimental setup deployed a comprehensive monitoring and data acquisition system, which included running the Coremark benchmark augmented with custom routines to maximize fault observability within the core complex and memory. Tests were conducted under realistic conditions, replicating the energy spectra of atmospheric neutrons and orbital protons, thereby ensuring the relevance of the findings to real-world applications.
Theoretical and Practical Implications
From a theoretical perspective, the results underscore the potential of RISC-V-based SoCs for high-reliability applications by leveraging customizable architectural enhancements. The successful integration of TCLS and ECC mechanisms within a RISC-V framework sets a precedent for future SoC designs targeting radiation resilience.
Practically, the demonstrated reliability of Trikarenos positions it as a suitable candidate for deployment in automotive and space applications where system failures can have catastrophic consequences. The research indicates that ongoing efforts in hardware fault tolerance can achieve substantial reductions in error rates, which is crucial for the continuous operation of critical systems exposed to radiation.
Future Developments
Future research could focus on further optimizing the fault-tolerant mechanisms to reduce power consumption and implementation overhead without compromising reliability. Additionally, expanding the scope of testing to include other radiation types and energy levels could provide a more comprehensive validation of the design. Investigating the integration of additional fault detection schemes, such as watchdog timers for independent core operations, may offer enhanced system resilience.
Conclusion
The evaluation of the Trikarenos SoC highlights its potential as an effective fault-tolerant solution for environments with stringent reliability requirements. The combination of TCLS and ECC within a RISC-V framework has shown to mitigate the adverse effects of radiation significantly, underscoring the viability of adopting open ISA architectures for critical application domains. The research presented in this paper offers valuable insights and serves as a reference point for future developments in fault-tolerant system design.
Related Papers
- SafeLS: Toward Building a Lockstep NOEL-V Core (2023)
- Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space (2023)
- On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster (2022)
- Trikarenos: A Fault-Tolerant RISC-V-based Microcontroller for CubeSats in 28nm (2023)
- SentryCore: A RISC-V Co-Processor System for Safe, Real-Time Control Applications (2024)