GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning (2403.12003v2)
Abstract: Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Code: https://github.com/xiaojieli0903/genview.
- Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
- Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
- Instance-conditioned gan data augmentation for representation learning. arXiv preprint arXiv:2303.09677, 2023.
- Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
- Renaissance: A survey into ai text-to-image generation in the era of large model. arXiv preprint arXiv:2309.00810, 2023.
- Large scale gan training for high fidelity natural image synthesis. In ICLR, 2018.
- A data augmentation perspective on diffusion models and retrieval. arXiv preprint arXiv:2304.10253, 2023.
- End-to-end object detection with transformers. In ECCV, 2020.
- Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023.
- Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566, 2020.
- Improved baselines with momentum contrastive learning. In arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, 2020.
- Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289, 2023.
- With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In ICCV, 2021.
- Whitening for self-supervised representation learning. In ICML, 2021.
- Diverse data augmentation with diffusions for effective test-time prompt tuning. In ICCV, 2023.
- Learning and leveraging world models in visual representation learning. arXiv preprint arXiv:2403.00504, 2024.
- Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Generative adversarial nets. In NeurIPS, 2014.
- Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
- Constructive assimilation: Boosting contrastive learning performance through view generation strategies. arXiv preprint arXiv:2304.00601, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022a.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022b.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Learning where to learn in cross-view self-supervised learning. In CVPR, 2022a.
- Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022b.
- Generative models as a data source for multiview representation learning. In ICLR, 2021.
- Training generative adversarial networks with limited data. In NeurIPS, 2020.
- Neural transformation network to generate diverse views for contrastive learning. In CVPR, 2023.
- Auto-encoding variational Bayes. In ICLR, 2014.
- Learning multiple layers of features from tiny images. 2009.
- Tiny imagenet visual recognition challenge. In CS 231N, 2015.
- Prototypical contrastive learning of unsupervised representations. In ICLR, 2020a.
- Detector-in-detector: Multi-level analysis for human-parts. In ACCV, 2018.
- Local correlation consistency for knowledge distillation. In ECCV, 2020b.
- Transformer-based visual segmentation: A survey. arXiv pre-print, 2023a.
- Mask again: Masked knowledge distillation for masked video modeling. In ACM MM, 2023b.
- Fine-grained key-value memory enhanced predictor for video representation learning. In ACM MM, 2023c.
- Omg-seg: Is one model good enough for all segmentation? In CVPR, 2024.
- Eliminating gradient conflict in reference-based line-art colorization. In ECCV, 2022.
- Is synthetic data from diffusion models ready for knowledge distillation? arXiv preprint arXiv:2305.12954, 2023d.
- Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221, 2023e.
- Microsoft coco: common objects in context. In ECCV, 2014.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Rethinking the effect of data augmentation in adversarial contrastive cearning. arXiv preprint arXiv:2303.01289, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Crafting better contrastive views for siamese representation learning. In CVPR, 2022.
- Learning generalized transformation equivariant representations via autoencoding transformations. In TPAMI, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with vq-vae-2. In NeurIPS, 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Casting your model: Learning to localize improves self-supervised representations. In CVPR, 2021.
- Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In CVPR, 2023.
- Viewmaker networks: Learning views for unsupervised representation learning. In ICLR, 2020.
- Contrastive multiview coding. In ECCV, 2020a.
- What makes for good views for contrastive learning? In NeurIPS, 2020b.
- Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984, 2023.
- Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023.
- Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
- Head: Hetero-assists distillation for heterogeneous object detectors. In ECCV, 2022a.
- Art-point: Improving rotation robustness of point cloud classifiers via adversarial rotation. In CVPR, 2022b.
- Dense contrastive learning for self-supervised visual pre-training. In CVPR, 2021.
- Towards language-driven video inpainting via multimodal large language models. CVPR, 2024a.
- Towards open vocabulary learning: A survey. TPAMI, 2024b.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Region similarity representation learning. arXiv preprint arXiv:2103.12902, 2021.
- Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
- Local manifold augmentation for multiview semantic consistency. arXiv preprint arXiv:2211.02798, 2022a.
- Towards theoretically inspired neural initialization optimization. In NeurIPS, 2022b.
- Exploiting synthetic data for data imbalance problems: baselines from a data perspective. arXiv preprint arXiv:2308.00994, 2023.
- Boosting unsupervised contrastive learning using diffusion-based data augmentation from scratch. arXiv preprint arXiv:2309.07909, 2023.
- Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
- Free-atm: Exploring unsupervised learning on diffusion-generated images with free attention masks. arXiv preprint arXiv:2308.06739, 2023.
- Expanding small-scale datasets with guided imagination. arXiv preprint arXiv:2211.13976, 2022.
- Ressl: Relational self-supervised learning with weak augmentation. In NeurIPS, 2021.
- Training on thin air: Improve image classification with generated data. arXiv preprint arXiv:2305.15316, 2023.
- Xiaojie Li (30 papers)
- Yibo Yang (80 papers)
- Xiangtai Li (128 papers)
- Jianlong Wu (38 papers)
- Yue Yu (343 papers)
- Bernard Ghanem (255 papers)
- Min Zhang (630 papers)