Recovering merge-based gains at long learning-rate decay horizons
Determine whether alternative weight-space checkpoint merging configurations—such as different coefficient schedules beyond minus‑sqrt, finer‑grained sliding‑window checkpoint averaging, or merging strategies tailored to long learning‑rate decay horizons—can recover the merge-based performance gains observed at shorter horizons during pretraining of large language models like Nemotron 3 Super 120B‑A12B Base.
References
That said, our experiments explored only a single merge schedule (minus-sqrt) and a fixed checkpoint granularity; it remains plausible that alternative coefficient schemes, finer-grained checkpoint windows, or merging strategies tailored to longer decay horizons could recover the gains observed at shorter scales.
— Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
(2604.12374 - NVIDIA et al., 14 Apr 2026) in Subsection "Tracking Merge Evaluation" (Pretraining)