Self-Consuming Generative Models Go MAD (2307.01850v1)

Published 4 Jul 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease.

PDF Abstract

Analyzing Self-Consuming Generative Models: A Study on Model Autophagy Disorder (MAD)

The paper "Self-Consuming Generative Models Go MAD" investigates the dynamics and potential consequences of training generative models using synthetic data from prior generations. This process is termed as autophagous training, akin to a "self-consuming" cycle. The central focus is on understanding how such recursive processes can degrade model quality over iterations, leading to what the authors describe as Model Autophagy Disorder (MAD).

Autophagous Loop Variants

The research identifies three key variants of autophagous training loops which represent different levels of reliance on synthetic data:

Full Synthetic Loop: Models are trained solely on synthetic data generated by previous models. The analysis indicates that in this scenario, either the quality (measured in terms of precision) or diversity (measured as recall) of the models decreases monotonically over generations.
Mixed with Fixed Real Data: This setup includes both synthetic data and a fixed set of real data. While this combination delays deterioration, eventual degradation in model performance is inevitable, as seen by increasing Fréchet Inception Distance (FID) and decreasing precision and recall.
Mixed with Fresh Real Data: Each iteration incorporates new real data in addition to synthetic data. This approach prevents the generative process from degrading, provided there is a sufficient influx of fresh real data.

Impact of Sampling Bias

A significant portion of the paper is dedicated to the effects of sampling bias. User preferences for high-quality synthetic samples induce bias, typically enhancing quality at the expense of diversity. This trade-off has profound implications:

Without bias, models exhibit a random walk behavior, where modal drift results in quality loss over time.
With bias, precision can be maintained or even improved, but recall declines more sharply, reflecting reduced diversity.

Empirical and Theoretical Investigations

Both theoretical analyses and empirical studies are presented across a variety of generative models, including Gaussian mixtures, StyleGAN, and DDPMs. Results consistently demonstrate the onset of MAD when models are cyclically trained on synthetic data without adequate real data supplementation.

Gaussian Models: Provided foundational insights, illustrating variance collapse due to recursive estimation errors.
Complex Models: StyleGAN and DDPM experiments reinforced the prediction of artifact proliferation and diversity loss, especially under biased sampling conditions.

Implications and Future Considerations

The findings emphasize the necessity of incorporating fresh real data in training loops to avert MAD. As data scarcity is often a challenge, strategies for identifying and ensuring the presence of real data in training sets become crucial.

Furthermore, the work suggests directions for future research, including synthetic data recognition and minimization techniques to manage synthetic-to-real data ratios effectively. The broader ramifications extend to other domains such as text generation with LLMs, underscoring the relevance of this paper beyond imagery.

In conclusion, "Self-Consuming Generative Models Go MAD" profoundly contributes to our understanding of the constraints and requirements for sustaining generative model performance in an increasingly synthetic data-driven landscape. This paper provides crucial insights and guidelines for researchers and practitioners engaged in model development and deployment, emphasizing the potential pitfalls of synthetic data over-reliance and the importance of strategic data sourcing.