Risk of Gradient Norm Clipping Interaction Leading to Momentum-Only Updates in Sensitive Regions
Determine whether the interaction in which an infinite pre-clipping gradient norm causes gradient norm clipping to zero out all gradients—resulting in momentum-only optimizer updates—poses an even greater risk to training stability in highly sensitive regions of the loss surface during large language model pretraining.
References
It remains an open question whether this interaction with gradient norm clipping poses an even greater risk to training stability in highly sensitive regions of the loss surface.
— Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
(2604.00726 - Altenbernd et al., 1 Apr 2026) in Subsection 4.4, Severe Effects in the Backward Pass (Momentum-only updates)