Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws of Synthetic Images for Model Training ... for Now (2312.04567v1)

Published 7 Dec 2023 in cs.CV
Scaling Laws of Synthetic Images for Model Training ... for Now

Abstract: Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.

The Impact of Synthetic Data on Machine Learning Models

Understanding Synthetic Data in Model Training

In the field of machine learning, the availability and quality of training data is a cornerstone of building robust models. Synthetic data generation has come to the forefront as a means to augment the limited supply of curated datasets. Researchers have been exploring the effectiveness of using images created by text-to-image models for training purposes. A recent examination of this approach has provided new insights into the effectiveness of synthetic data in training both supervised models and CLIP (Contrastive Language–Image Pretraining) models.

Key Findings from Recent Studies

Effectiveness in Supervised Models

When it comes to image classifiers trained under supervised settings, synthetic data has shown scaled efficiency, albeit less effectively when compared to real images. The power-law relationship between training data size and validation loss applies here, although the convergence of this loss ratio experiences a shift when the synthetic dataset becomes exceedingly large. The inability of text-to-image models to render certain complex concepts appears to be a pivotal factor in this scaling inefficiency.

Advantages in Special Scenarios

Despite its general limitations, synthetic data demonstrates particular advantages in specific scenarios:

  • Instances of limited real data for supervised problems show that synthetic data can be scaled more effectively.
  • Synthetic data can outperform real data in out-of-distribution tests, suggesting it may be useful for generalizing beyond the original data distribution.
  • In CLIP training, the combination of synthetic and real data can significantly boost model performance, particularly in cases wherein available training data is scarce.
Influence of Model Choices and Prompts

Furthermore, the paper uncovers that different choices in text-to-image models, classifier-free guidance scale, and the nature of text prompts have significant impacts on the scaling efficiency of synthetic data. After optimizing these variables, it became evident that synthetic data yielded a similar scaling trend to real data, especially for CLIP training, though it remained slightly less effective.

Implications for the Future

The insights from this research imply that synthetic data has the potential to be particularly effective in conditions where there is a substantial domain shift or when real images are not abundant. This is an encouraging development for scenarios that demand extensive data diversification or where data curation is challenging. Looking ahead, the results stress the need to refine the existing generative models to overcome their current limitations, which could eventually enable synthetic data to rival or even outperform real data in a wide range of training situations.

The contribution of this paper enriches our understanding of the role synthetic data can play as we continue to push the boundaries of machine learning capabilities and seek new solutions to data limitations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 2018.
  2. Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728, 2023.
  3. Revisiting neural scaling laws in language and vision. NeurIPS, 2022.
  4. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  5. Learning to see by looking at noise. In NeurIPS, 2021.
  6. Improving image generation with better captions, 2023.
  7. Food-101–mining discriminative components with random forests. In ECCV, 2014.
  8. Diffusion models as artists: Are we closing the gap between humans and machines? arXiv preprint arXiv:2301.11722, 2023.
  9. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  10. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  11. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In CVPR, 2019.
  12. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Describing textures in the wild. In CVPR, 2014.
  15. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  16. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshops, 2020.
  17. Generative adversarial networks (gan) based efficient sampling of chemical composition space for inverse design of inorganic materials. NPJ Computational Materials, 2020.
  18. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  19. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  21. One-shot learning of object categories. TPAMI, 2006.
  22. Generative adversarial nets. NeurIPS, 2014.
  23. Data and parameter scaling laws for neural machine translation. In EMNLP, 2021.
  24. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022a.
  25. Generate, annotate, and learn: Nlp with synthetic text. TACL, 2022b.
  26. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  27. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a.
  28. Natural adversarial examples. CVPR, 2021b.
  29. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  30. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  31. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  32. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  33. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  34. Denoising diffusion probabilistic models. NeurIPS, 2020.
  35. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  36. Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258, 2021.
  37. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  38. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  39. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  40. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  41. 3d object representations for fine-grained categorization. In ICCV workshops, 2013.
  42. Learning multiple layers of features from tiny images. 2009.
  43. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245, 2020.
  44. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  45. Scaling language-image pre-training via masking. In CVPR, 2023b.
  46. Palm up: Playing in the latent manifold for unsupervised pretraining. arXiv preprint arXiv:2210.10913, 2022.
  47. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  48. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  49. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538, 2022.
  50. GeorgeĀ A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.
  51. Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In SLT, 2018.
  52. Lens: Localization enhanced by nerf synthesis. In CoRL, 2022.
  53. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  54. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  55. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  56. Cats and dogs. In CVPR, 2012.
  57. Learning deep object detectors from 3d models. In CVPR, 2015.
  58. Learning transferable visual models from natural language supervision. In ICML, 2021.
  59. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  61. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  62. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In CVPR, 2018.
  63. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  64. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  65. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
  66. Speech recognition with augmented synthesized speech. In ASRU, 2019.
  67. Generating synthetic audio data for attention-based speech recognition systems. In ICASSP, 2020.
  68. On rendering synthetic images for training an object detector. CVIU, 2015.
  69. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  70. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR, 2023.
  71. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  72. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  73. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  74. Beyond neural scaling laws: beating power law scaling via data pruning. NeurIPS, 2022.
  75. The german traffic sign recognition benchmark: a multi-class classification competition. In IJCNN, 2011.
  76. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  77. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
  78. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models., 2023.
  79. Yfcc100m: The new data in multimedia research. Communications of the ACM, 2016.
  80. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
  81. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine, 2020.
  82. Neural discrete representation learning. NeurIPS, 2017.
  83. Attention is all you need. NeurIPS, 2017.
  84. Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
  85. Ross Wightman. Pytorch image models. https://github.com/huggingface/pytorch-image-models, 2019.
  86. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
  87. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  88. Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546, 2020.
  89. Nerf-supervision: Learning dense object descriptors from neural radiance fields. In ICRA, 2022.
  90. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  91. Scaling vision transformers. In CVPR, 2022.
  92. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  93. Learning deep features for scene recognition using places database. NeurIPS, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lijie Fan (19 papers)
  2. Kaifeng Chen (18 papers)
  3. Dilip Krishnan (36 papers)
  4. Dina Katabi (37 papers)
  5. Phillip Isola (84 papers)
  6. Yonglong Tian (32 papers)
Citations (45)
Github Logo Streamline Icon: https://streamlinehq.com