Explain optimizer performance gaps in diffusion model training
Explain the performance differences observed between AdamW and SGD and between Muon or SOAP and AdamW when training the U-Net DDPM on Navier–Stokes Kolmogorov-flow trajectories, and identify the architectural or optimization factors that cause these gaps in training/validation loss and generative quality given that class-imbalance explanations do not apply in this setting.
References
Another open question that remains is to explain the performance difference between and and between / and , as this benchmark problem lies outside of the domain of previously offered explanations.
— Optimization Benchmark for Diffusion Models on Dynamical Systems
(2510.19376 - Schaipp, 22 Oct 2025) in Conclusion (Section 4)