Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Vision from Models Rivals Learning Vision from Data (2312.17742v1)

Published 28 Dec 2023 in cs.CV
Learning Vision from Models Rivals Learning Vision from Data

Abstract: We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

Introduction

In the field of artificial intelligence, particularly in the field of computer vision, representation learning is a foundational aspect that involves the transformation of raw data into a format that machines can utilize to perform tasks such as recognizing objects, understanding scenes, and more. The effectiveness of this learning process depends significantly on the diversity and quality of the underlying data.

Researchers have historically relied on large, real-world image datasets to train algorithms, but this approach is not without its challenges, including the cost and complexity of data collection and potential scaling inefficiencies. An emerging alternative is to use synthetic image data produced by generative models, which are algorithms trained to create new content that resembles the training data. This strategy is explored through the introduction of SynCLR, a system that leverages generative models to create vast arrays of synthetic images paired with textual descriptions.

SynCLR: Learning from Synthetic Data

SynCLR proposes an approach where visual class definitions are tied to textual captions. By generating textual captions with LLMs, and then converting these captions into images using text-to-image models, SynCLR creates a substantial dataset of visual representations. The key is that all images paired with the same caption are treated as belonging to the same visual class. This strategy allows the grouping of images with shared concepts or themes, contributing to a richer understanding of the visual information than traditional methods.

Impact on Visual Tasks

The SynCLR-trained models demonstrate impressive performance across various visual tasks. They achieve linear classification accuracies on par with that of other leading visual representation learning methods like CLIP and even outperform some self-supervised approaches pre-trained on real data. Beyond image classification, SynCLR extends its capabilities to dense prediction tasks, such as semantic segmentation on ADE20k, presenting strong transfer abilities and rivaling methods that involve higher-resolution training phases or intermediate fine-tuning stages.

Findings and Future Work

SynCLR's success highlights the potential of learning from synthetic data. Its equivalence in performance with models trained on real images suggests that synthetic datasets can be a cost-effective and scalable resource for training visual representations. Looking ahead, further refining the process through which captions are synthesized, exploring different data sampling strategies, and adopting advanced model architectures may unlock even greater performance gains.

The approach exemplified by SynCLR opens a promising direction for visual representation learning, where generative models not only reduce dependence on real-world data collection but also enable more flexible and scalable dataset curation. The exciting outcomes of this research invite continued exploration into the capabilities of synthetic data in the ever-evolving landscape of machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (111)
  1. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 2018.
  2. Masked siamese networks for label-efficient learning. In ECCV, 2022.
  3. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  4. Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML, 2022.
  5. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  6. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 1992.
  7. Food-101–mining discriminative components with random forests. In ECCV, 2014.
  8. Denoising pretraining for semantic segmentation. In CVPR, 2022.
  9. Language models are few-shot learners. NeurIPS, 2020.
  10. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  11. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  12. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  13. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  14. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  15. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In CVPR, 2019.
  16. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017.
  17. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
  18. Describing textures in the wild. In CVPR, 2014.
  19. Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
  20. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  21. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshops, 2020.
  22. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013.
  23. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  24. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  25. Large scale adversarial representation learning. NeurIPS, 2019.
  26. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  27. Peco: Perceptual codebook for bert pre-training of vision transformers. In AAAI, 2023.
  28. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  29. The pascal visual object classes (voc) challenge. IJCV, 2010.
  30. Scaling laws of synthetic images for model training … for now. arXiv:2312.04567, 2023a.
  31. Improving clip training with language rewrites. In NeurIPS, 2023b.
  32. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023a.
  33. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023b.
  34. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR, 2004.
  35. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  36. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  37. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  38. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019.
  39. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  40. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  41. Deep residual learning for image recognition. In CVPR, 2016.
  42. Rethinking imagenet pre-training. In ICCV, 2019.
  43. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  44. Masked autoencoders are scalable vision learners. In CVPR, 2022a.
  45. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022b.
  46. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  47. Deep networks with stochastic depth. In ECCV, 2016.
  48. Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258, 2021.
  49. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  50. Supervised contrastive learning. In NeurIPS, 2020.
  51. Collecting a large-scale dataset of fine-grained cars. tech report, 2013.
  52. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  53. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245, 2020.
  54. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  55. Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023a.
  56. Mage: Masked generative encoder to unify representation learning and image synthesis. In CVPR, 2023b.
  57. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
  58. Palm up: Playing in the latent manifold for unsupervised pretraining. arXiv preprint arXiv:2210.10913, 2022.
  59. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  60. Fine-grained visual classification of aircraft. arXiv:1306.5151, 2013.
  61. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  62. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538, 2022.
  63. Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In SLT, 2018.
  64. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  65. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
  66. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  67. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  68. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  69. Cats and dogs. In CVPR, 2012.
  70. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  71. Learning transferable visual models from natural language supervision. In ICML, 2021.
  72. High-resolution image synthesis with latent diffusion models. In CVPR, 2022a.
  73. High-resolution image synthesis with latent diffusion models. In CVPR, 2022b.
  74. Speech recognition with augmented synthesized speech. In ASRU, 2019.
  75. Generating synthetic audio data for attention-based speech recognition systems. In ICASSP, 2020.
  76. Weighted ensemble self-supervised learning. arXiv preprint arXiv:2211.09981, 2022.
  77. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  78. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR, 2023.
  79. The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. arXiv preprint arXiv:2306.01923, 2023.
  80. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  81. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops, 2014.
  82. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  83. Mastering the game of go without human knowledge. Nature, 2017.
  84. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  85. The german traffic sign recognition benchmark: a multi-class classification competition. In IJCNN, 2011.
  86. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  87. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models., 2023.
  88. Contrastive multiview coding. arXiv:1906.05849, 2019.
  89. What makes for good views for contrastive learning? In NeurIPS, 2020.
  90. Divide and contrast: Self-supervised learning from uncurated data. In ICCV, 2021.
  91. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. In NeurIPS, 2023.
  92. Going deeper with image transformers. In ICCV, 2021.
  93. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  94. Learning from synthetic humans. In CVPR, 2017.
  95. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.
  96. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  97. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  98. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
  99. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  100. Simmim: A simple framework for masked image modeling. In CVPR, 2022.
  101. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  102. Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546, 2020.
  103. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  104. Scaling vision transformers. In CVPR, 2022.
  105. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  106. Colorful image colorization. In ECCV, 2016.
  107. Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
  108. Learning deep features for discriminative localization. In CVPR, 2016.
  109. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  110. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  111. Training on thin air: Improve image classification with generated data. arXiv preprint arXiv:2305.15316, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yonglong Tian (32 papers)
  2. Lijie Fan (19 papers)
  3. Kaifeng Chen (18 papers)
  4. Dina Katabi (37 papers)
  5. Dilip Krishnan (36 papers)
  6. Phillip Isola (84 papers)
Citations (30)