Explain optimizer performance gaps in diffusion model training

Explain the performance differences observed between AdamW and SGD and between Muon or SOAP and AdamW when training the U-Net DDPM on Navier–Stokes Kolmogorov-flow trajectories, and identify the architectural or optimization factors that cause these gaps in training/validation loss and generative quality given that class-imbalance explanations do not apply in this setting.

Background

The benchmark shows a clear gap between SGD and AdamW in both validation loss and generative quality for the studied diffusion task, even though class imbalance—a known explanation in language modeling—is not present.

The authors also find that Muon and SOAP achieve lower final loss than AdamW under similar budgets, but the reasons for these differences remain outside previously offered explanations, prompting an explicit open question.

References

Another open question that remains is to explain the performance difference between and and between / and , as this benchmark problem lies outside of the domain of previously offered explanations.

— Optimization Benchmark for Diffusion Models on Dynamical Systems (2510.19376 - Schaipp, 22 Oct 2025) in Conclusion (Section 4)

Explain optimizer performance gaps in diffusion model training

Background

References

Related Problems