Explain why SIMS benefits from small amounts of synthetic data pollution
Investigate and explain why Self-IMproving diffusion models with Synthetic data (SIMS) can improve Fréchet Inception Distance (FID) when the base diffusion model is trained on a dataset polluted with a small amount of synthetic data from a previous-generation model in a synthetic augmentation loop, as observed for CIFAR‑10 when |D_p| < 30k and FFHQ‑64 when |D_p| < 15k; determine the mechanism by which negative guidance exploits such polluting synthetic data and identify the conditions under which this improvement occurs.
Sponsor
References
More precisely, the plots indicate that, for $|D_{\rm p}| < 30$k with CIFAR-10 (60\% of $|D_{\rm r}|$) and $|D_{\rm p}| < 15$k for FFHQ-64 (20\% of $|D_{\rm r}|$), SIMS not only prevents MADness in the second generation models but also achieves a self-improved FID by somehow exploiting the polluting synthetic data from the previous generation in its training set. The reason for this behavior remains an interesting open research question.