LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training

Published 12 Apr 2026 in cs.AR | (2604.10390v1)

Abstract: Large-scale LLM training is increasingly susceptible to hardware defects stemming from manufacturing escapes and silicon aging. These defects manifest as Silent Data Corruption (SDC) that perturb gradients and parameters throughout the training process. We present LLM-PRISM, a methodology to characterize LLM pre-training resilience to hardware faults. LLM-PRISM couples RTL-level GPU fault simulation with a stochastic injection engine embedded in Megatron-LM. Through 7,664 training runs across FP16, BF16, and FP8 regimes, we analyze how fault type, rate, and numeric format govern resilience. We find that while LLMs resist low-frequency faults, impact is highly non-uniform; critical datapaths and specific precision formats can induce catastrophic divergence even at moderate fault rates. This study provides the first hardware-grounded, pre-training characterization of SDC resilience.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces LLM-PRISM, a comprehensive hardware-software fault injection methodology that characterizes silent data corruption from permanent GPU defects during LLM training.
It employs RTL simulation and a seven-dimensional fault site tuple to map error signatures, revealing distinct failure modes such as spike-and-recover and silent degradation.
Empirical analysis across FP16, BF16, and FP8 formats demonstrates critical trade-offs between crash risk and silent errors, influencing model convergence and reliability.

Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training: The LLM-PRISM Methodology

Motivation and Problem Statement

The LLM-PRISM study systematically investigates the resilience of LLM training to silent data corruption (SDC) from permanent and intermittent GPU hardware faults. Unlike transient faults arising from single bit-flips, permanent hardware defects—arising from test escapes, early-life failures, or silicon aging—can persist or intermittently corrupt values during extensive training runs, introducing subtle or catastrophic numerical errors. In large-scale LLM training infrastructure, where clusters employ tens of thousands of GPUs for weeks or months, the aggregate probability of encountering such faults is significant.

Permanent and intermittent SDC present unique threats: they may not crash the training run or introduce NaNs, and their impact can propagate silently, affecting not just convergence dynamics but downstream LLM behavior and overall reliability. This necessitates a detailed, hardware-grounded investigation of how these faults manifest, propagate, and interact with numerical format and system-level mitigations.

Hardware-Grounded Fault Characterization and Methodology

LLM-PRISM introduces an integrated hardware-software approach by coupling RTL-level GPU fault characterization with a software-level, parameterized fault injection engine deployed in Megatron-LM. The flow, depicted below, involves sampling permanent faults from plausible physical mechanisms (stuck-at and timing-dependent defects) using RTL simulation and mapping them to observable error signatures.

Figure 1: RTL characterization flow for error signature extraction.

Empirical error signatures—collected from FMA datapaths, tensor cores, register files, and on-chip caches—show characteristic corruption patterns:

Localized bit flips within matrix tiles from datapath faults,
High-magnitude, sporadic numeric outliers from ALU faults,
Structured row-wise corruptions when faults manifest in memory arrays.
Figure 2: Examples error signatures derived from RTL characterization.

These signatures inform a rigorous seven-dimensional software-level “fault site tuple” parameterization, spanning spatial (device rank, layer, sub-module), temporal (injection onset, activation rate), and bit-level (fault density, bit-flip profile) aspects.

Figure 3: Software level fault site tuple characterized with seven parameters, categorizing temporal, spatial/architectural, and intensity/bit-level properties.

A Bernoulli parameter governs intermittent activation, unifying permanent (always-on, $r=1$ ) and intermittent ( $r \ll 1$ ) defects in a common framework.

Figure 4: Stochastic Intermittent Fault Activation Model.

Experimental Design

The study executes 7,664 complete fault-injected training runs (4,681 on GPT-2 Small, 2,983 on GPT-2 Medium) using WikiText. The injection campaigns sweep fault types, numerical formats (FP16, BF16, FP8), rates, and target sites. The injection begins at variable checkpoints, affecting forward activations or backward gradients, allowing for fine-grained analysis of perturbation locality and timing.

The campaign rigorously isolates numerical format effects and employs both the model’s perplexity and downstream task scores to assess end-to-end impact.

Failure Modes Induced by Permanent and Intermittent Faults

Full pre-training under permanent fault injection uncovers four qualitatively distinct failure modes, with empirical loss trajectories and parameter drift behavior:

Spike-and-Recover: Isolated NaN events are absorbed by mixed-precision recovery and optimization proceeds normally.
Spike-and-Degrade: Frequent NaNs overwhelm recovery, leading to divergence.
Silent Degradation: No NaN/Inf events are observed, but PPL settles above baseline, masking persistent model degradation.
Gradual Drift: Loss remains visually indistinguishable from baseline, but cumulative parameter divergence results in silent corruption undetectable by loss metrics.
Figure 5: Representative loss traces and global parameter divergence for four primary failure modes under permanent fault injection.

FP8 demonstrates strictly binary outcomes (recovery or crash), while FP16 and BF16 display more diverse modes, including silent degradations.

Quantitative Resilience and Format Dependence

Analysis of final model quality, training loss anomalies, and NaN propagation reveals nuanced interactions between format, injection phase, and activation rate.

Resilience at Low Fault Rates: Across formats, $>57\%$ of runs complete with baseline PPL at moderate fault rates.
Format Dependence: FP16 suffers the highest non-crashing silent degradation fraction (up to 11.0% Changed), while BF16 and FP8 are more likely to crash outright under elevated fault rates, but degrade silently less.
Model Scale: Larger models exhibit a higher rate of silent degradation, suggesting increased vulnerability as scale rises.
Figure 6: Distribution of training outcomes (Unchanged, Changed, Crashed) and NaN/Inf incidence across data formats.

Downstream task evaluation of BF16 GPT-2 Medium runs demonstrates that models classified as silent degradation or gradual drift, despite near-baseline loss, can manifest substantial capability loss relative to fault-free baselines.

Figure 7: Downstream performance of faulty GPT2-Medium (BF16) models on the Children's Book Test and Winograd Schema.

PPL sensitivity analysis shows that higher fault rates, earlier injection points, and forward/input-gradient phase targeting increase the likelihood of observing degraded convergence.

Figure 8: Variation in PPL with fault rate, injection checkpoint, and injection phase.

Effectiveness and Limitations of Runtime Safeguards

The efficacy of runtime NaN checks is format-dependent. For FP16 and BF16, the check converts a substantial portion of crashes to spike-and-recover events but does not eliminate silent degradations. For FP8, most outcomes are unaffected as the format’s narrow dynamic range enforces binary (saturating or crashing) outcomes.

Figure 9: Impact of loss NaN-checks on outcome distributions and silent degradation across formats.

Limitations and Future Directions

LLM-PRISM's evaluation leverages hardware-aware software injection but is constrained to the GPT-2 scale for campaign feasibility. Hardware-model realism is key, and the fidelity of the injected error signatures must match observed silicon fault modes for transferability to larger foundation models. Additionally, alternate parallelism strategies and multi-defect interactions require further investigation to assess resilient system-level designs.

Conclusion

LLM-PRISM establishes the first comprehensive, hardware-grounded empirical baseline for silent data corruption resilience during LLM pre-training (2604.10390). Strong claims substantiated by this work include:

Standard metrics and loss traces are unreliable SDC detectors: Substantial parameter and capability divergence can occur with no NaN/Inf or loss anomaly.
Numerical format selection imposes critical trade-offs: FP16 is susceptible to persistent silent degradation; BF16 and FP8 trade off higher outright crash risk for reduced silent error rates.
System and algorithm co-design is imperative: Effective SDC detection must go beyond NaN guards, integrating hardware-aware monitoring and possibly fault-tolerant optimization routines.

Practically, this work motivates deployment of hardware-health monitoring, SDC-aware redundancy, and adaptive runtime error handling. Theoretically, it frames the need for statistical and numerical analysis of optimizer dynamics under persistent, structured perturbations. Future developments should address real-time detection of silent SDCs, resilience scaling with massive transformer sizes, and co-designed mitigation across the hardware-software stack.

Markdown Report Issue