Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real-Fake: Effective Training Data Synthesis Through Distribution Matching (2310.10402v2)

Published 16 Oct 2023 in cs.LG and cs.AI

Abstract: Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits such as out-of-distribution generalization, privacy preservation, and scalability. Specifically, we achieve 70.9% top1 classification accuracy on ImageNet1K when training solely with synthetic data equivalent to 1 X the original real data size, which increases to 76.0% when scaling up to 10 X synthetic data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  2. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  3. Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
  4. This dataset does not exist: training models from generated images. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2020.
  5. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
  6. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  7. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. IEEE, 2022.
  8. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
  9. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
  11. Supervised Learning, pp.  21–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. ISBN 978-3-540-75171-7. doi: 10.1007/978-3-540-75171-7_2.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  13. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  14. The 2021 image similarity dataset and challenge. arXiv preprint arXiv:2106.09672, 2021.
  15. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  16. A kernel method for the two-sample problem. arXiv preprint arXiv:0805.2368, 2008.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  18. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
  19. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pp.  204–207. IEEE, 2018.
  20. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a.
  21. Natural adversarial examples. CVPR, 2021b.
  22. David Von Hilbert. Grundzüge einer allgemeinen theorie der linearen integralgleichungen, von david hilbert. 1904.
  23. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  24. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  25. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  26. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pp.  409–426, 1994.
  27. Jeremy Howard. Imagewang, 2019. URL https://github.com/fastai/imagenette/.
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  29. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  30. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, June 2013.
  31. Image captions are natural prompts for text-to-image models. arXiv preprint arXiv:2307.08526, 2023.
  32. Triple generative adversarial nets. Advances in neural information processing systems, 30, 2017.
  33. Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21330–21340, 2022.
  34. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  35. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  36. Text-driven visual synthesis with latent diffusion prior. arXiv preprint arXiv:2302.08510, 2023.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  38. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  39. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  40. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
  41. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  42. G. Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384, 1981. ISSN 0550-3213. doi: https://doi.org/10.1016/0550-3213(81)90056-0.
  43. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  44. A self-supervised descriptor for image copy detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14532–14542, 2022.
  45. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  46. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  47. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
  48. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  49. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  50. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  51. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8011–8021, June 2023.
  52. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
  53. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6048–6058, 2023.
  54. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  55. Ilya Sutskever. An observation on generalization, 2023.
  56. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp.  776–794. Springer, 2020.
  57. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
  58. Dataset interfaces: Diagnosing model failures using controllable counterfactual generation. arXiv preprint arXiv:2302.07865, 2023.
  59. Caltech-ucsd birds-200-2011 (cub-200-2011). Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  60. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506–10518, 2019.
  61. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
  62. Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
  63. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022.
  64. Not just pretty pictures: Text-to-image generators enable interpretable interventions for robust representations. arXiv preprint arXiv:2212.11237, 2022.
  65. Federated generative learning with foundation models. arXiv preprint arXiv:2306.16064, 2023.
  66. Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10145–10155, 2021.
  67. Bo Zhao and Hakan Bilen. Synthesizing informative training samples with GAN. NeurIPS 2022 Workshops SyntheticData4ML, 2022.
  68. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023.
  69. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=mSAKhLYLSsl.
  70. Improved distribution matching for dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7856–7865, 2023.
  71. Training on thin air: Improve image classification with generated data. arXiv preprint arXiv:2305.15316, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jianhao Yuan (10 papers)
  2. Jie Zhang (846 papers)
  3. Shuyang Sun (25 papers)
  4. Philip Torr (172 papers)
  5. Bo Zhao (242 papers)
Citations (15)