The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better (2406.05184v4)

Published 7 Jun 2024 in cs.CV

Abstract: Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that targeted retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.

PDF HTML Abstract

The Unmet Promise of Synthetic Training Images: An Analysis

The paper "The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better" explores the comparative efficacy of synthetic versus real images in the context of training vision models. This investigation is grounded in the specific setting of image classification, focusing particularly on the utility of synthetic images generated by state-of-the-art text-to-image models like Stable Diffusion compared to real images retrieved from the training data of these generative models.

Core Findings and Numerical Results

The research systematically compares two methods for curating training datasets: generating synthetic images using a model trained on the LAION-2B dataset and retrieving real images from the same dataset. Empirical results demonstrate that real images consistently outperform their synthetic counterparts across various benchmarks, including ImageNet-1K, Describable Textures Dataset (DTD), FGVC-Aircraft, StanfordCars, and Oxford Flowers102. Notably, in the case of FGVC-Aircraft, while synthetic data improved linear probing accuracy by 3.8 percentage points, retrieved data achieved a substantial increase of 17.8 percentage points. Additionally, the performance advantage persisted even when the scale of synthetic data was increased significantly beyond the available real data.

Analytical Insights

The paper attributes the superior performance of real data to several factors, primarily the presence of generator artifacts and inaccuracies in task-relevant visual details within synthetic images. Generator artifacts, such as low-level blurs, and distortions in high-level class-specific details were observed in synthetic images, undermining their effectiveness. Qualitative analysis supported these observations, revealing that while synthetic images may visually resemble their real counterparts, critical divergences in detail and composition frequently occur.

Methodological Robustness

The paper’s methodological approach is robust, encompassing multiple versions of the Stable Diffusion model and extensive ablation experiments to assess consistency. Across different iterations of the generative model, the results consistently favored real images, reinforcing the reliability of the findings. Furthermore, the paper explored hybrid datasets comprising both synthetic and real images, revealing that mixing the two does not significantly exceed the performance of using real data alone.

Implications and Future Research Directions

This research emphasizes the importance of considering retrieval methods as a baseline when evaluating synthetic training data's utility. The implications are manifold; while synthetic data offers viable alternatives when real data access is restricted, practical applications of synthetic data require further refinement to bridge the observed performance gap.

Future research could explore generating images that explicitly account for compositions absent in the generator's training set, potentially enhancing the value of synthetic data. Additionally, given the results, there is a compelling need to improve the fidelity and semantic accuracy of synthetic images to make them comparable to real images in training efficacy.

Conclusion

"The Unmet Promise of Synthetic Training Images" provides a well-substantiated examination of the comparative value of synthetic versus retrieved real images in model training. By highlighting the limitations of synthetic data and promoting real data retrieval as a benchmark, the paper sets a clear research trajectory towards optimizing the use of synthetic data in training vision models. This work inspires future efforts to enhance generative model capabilities and data utilization strategies in artificial intelligence.