Efficacy of Momentum Reset as a Proactive Mitigation for Silent Data Corruption

Determine whether momentum reset of parameters with gradient spikes can serve as a proactive mitigation for silent data corruption during large language model pretraining, given that silent data corruption can cause widespread gradient corruption and may trigger numerous momentum resets.

Background

The paper observes that faults injected during the backward pass can cause gradient spikes and loss bumps, even when gradient norm clipping is enabled. Prior work proposes momentum resets for parameters exhibiting spike behavior to stabilize training. However, because silent data corruption can induce widespread gradient corruption, momentum resets might be triggered extensively, raising uncertainty about their effectiveness as a proactive mitigation strategy.

The authors explicitly flag the uncertainty about whether momentum resets effectively mitigate SDC-induced corruptions, motivating investigation into their utility and impact during LLM training.

References

It remains an open question whether momentum reset can serve as a proactive mitigation for \ac{SDC}, as \ac{SDC} can cause widespread gradient corruption and may trigger numerous momentum resets.

— Exploring Silent Data Corruption as a Reliability Challenge in LLM Training (2604.00726 - Altenbernd et al., 1 Apr 2026) in Subsection 4.4, Severe Effects in the Backward Pass (Gradient corruption persists even with clipping)

Efficacy of Momentum Reset as a Proactive Mitigation for Silent Data Corruption

Background

References

Related Problems