Risk of Gradient Norm Clipping Interaction Leading to Momentum-Only Updates in Sensitive Regions

Determine whether the interaction in which an infinite pre-clipping gradient norm causes gradient norm clipping to zero out all gradients—resulting in momentum-only optimizer updates—poses an even greater risk to training stability in highly sensitive regions of the loss surface during large language model pretraining.

Background

The paper shows that if the global gradient norm before clipping becomes infinite, gradient norm clipping scales all gradients to zero, effectively ignoring the mini-batch and causing the optimizer to apply a momentum-only update. In data-parallel settings, a single replica with extreme gradients can induce this effect globally after aggregation, discarding all replicas' work for that step.

Because such momentum-only updates may push the model along stale directions, the authors question whether this interaction with gradient norm clipping further destabilizes training, especially in regions of the loss surface that are highly sensitive to parameter changes.

References

It remains an open question whether this interaction with gradient norm clipping poses an even greater risk to training stability in highly sensitive regions of the loss surface.

— Exploring Silent Data Corruption as a Reliability Challenge in LLM Training (2604.00726 - Altenbernd et al., 1 Apr 2026) in Subsection 4.4, Severe Effects in the Backward Pass (Momentum-only updates)

Risk of Gradient Norm Clipping Interaction Leading to Momentum-Only Updates in Sensitive Regions

Background

References

Related Problems