Explain why SIMS benefits from small amounts of synthetic data pollution

Investigate and explain why Self-IMproving diffusion models with Synthetic data (SIMS) can improve Fréchet Inception Distance (FID) when the base diffusion model is trained on a dataset polluted with a small amount of synthetic data from a previous-generation model in a synthetic augmentation loop, as observed for CIFAR‑10 when |D_p| < 30k and FFHQ‑64 when |D_p| < 15k; determine the mechanism by which negative guidance exploits such polluting synthetic data and identify the conditions under which this improvement occurs.

Background

In realistic experiments on CIFAR‑10 and FFHQ‑64, the authors compare standard training versus SIMS when the real training data is polluted with synthetic samples generated by a prior model. While standard training degrades with increasing synthetic pollution (MAD), SIMS is relatively immune and even achieves improved FID for modest synthetic proportions.

This counterintuitive improvement suggests that SIMS may leverage the polluted synthetic data via its negative‑guidance mechanism, but the underlying reason is not understood, motivating a focused open question.

References

More precisely, the plots indicate that, for $|D_{\rm p}| < 30$k with CIFAR-10 (60\% of $|D_{\rm r}|$) and $|D_{\rm p}| < 15$k for FFHQ-64 (20\% of $|D_{\rm r}|$), SIMS not only prevents MADness in the second generation models but also achieves a self-improved FID by somehow exploiting the polluting synthetic data from the previous generation in its training set. The reason for this behavior remains an interesting open research question.

Self-Improving Diffusion Models with Synthetic Data (2408.16333 - Alemohammad et al., 2024) in Section 4.2.2 (Realistic Data in a Synthetic Augmentation Loop), end of Results