Recovering merge-based gains at long learning-rate decay horizons

Determine whether alternative weight-space checkpoint merging configurations—such as different coefficient schedules beyond minus‑sqrt, finer‑grained sliding‑window checkpoint averaging, or merging strategies tailored to long learning‑rate decay horizons—can recover the merge-based performance gains observed at shorter horizons during pretraining of large language models like Nemotron 3 Super 120B‑A12B Base.

Background

During the stable learning-rate phase of pretraining, the authors use offline checkpoint merging (sliding-window weighted averaging) to reduce performance noise and emulate LR decay, consistently observing 2–4 point average accuracy improvements over individual trained checkpoints across 12 benchmarks.

However, during the final long learning‑rate decay phase (5T tokens), the advantage of merged checkpoints narrows and largely disappears relative to decay‑trained checkpoints, contrasting with prior smaller‑scale reports where merging retained gains. The experiments here only considered a single coefficient schedule (minus‑sqrt) and fixed checkpoint granularity (checkpoints every ~25B tokens with windows of 125B/250B/500B).

The authors note that alternative coefficient schemes, finer checkpoint windows, or strategies designed specifically for long decay horizons might recover the gains seen at shorter horizons, but this remains unverified.

References

That said, our experiments explored only a single merge schedule (minus-sqrt) and a fixed checkpoint granularity; it remains plausible that alternative coefficient schemes, finer-grained checkpoint windows, or merging strategies tailored to longer decay horizons could recover the gains observed at shorter scales.

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning  (2604.12374 - NVIDIA et al., 14 Apr 2026) in Subsection "Tracking Merge Evaluation" (Pretraining)