Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations (2312.02205v1)
Abstract: Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
- A cookbook of self-supervised learning, 2023.
- Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
- The fast fourier transform. IEEE Spectrum, 4(12):63–70, 1967.
- Frequency domain image translation: More photo-realistic, better identity-preserving. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Unsupervised learning of visual features by contrasting cluster assignments, 2020.
- A simple framework for contrastive learning of visual representations, 2020a.
- Exploring simple siamese representation learning, 2020.
- Improved baselines with momentum contrastive learning, 2020b.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Why do self-supervised models transfer? investigating the impact of invariance on downstream tasks. ArXiv, abs/2111.11398, 2021.
- Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728, 2018.
- Bootstrap your own latent: A new approach to self-supervised learning, 2020.
- Structural sparseness and spatial phase alignment in natural scenes. JOSA A, 24(7):1873–1885, 2007.
- Deep residual learning for image recognition, 2015.
- Masked autoencoders are scalable vision learners, 2021.
- Gauss and the history of the fast fourier transform. IEEE ASSP Magazine, 1(4):14–21, 1984.
- The inaturalist species classification and detection dataset, 2018.
- Far: Fourier aerial video recognition, 2022.
- Improving transferability of representations via augmentation-aware self-supervision, 2021.
- Unsupervised learning of visual representations by solving jigsaw puzzles. ArXiv, abs/1603.09246, 2016.
- The importance of phase in signals. Proceedings of the IEEE, 69(5):529–541, 1981.
- Phase in speech and pictures. In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 632–637, 1979.
- Dinov2: Learning robust visual features without supervision, 2023.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
- A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception, 11(3):337–346, 1982. PMID: 7167342.
- Augmentation-aware self-supervised learning with guided projector. ArXiv, abs/2306.06082, 2023.
- Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. ArXiv, abs/2007.13916, 2020.
- Learning transferable visual models from natural language supervision, 2021.
- Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Internet of Things Journal, 8(2):1030–1040, 2021.
- wav2vec: Unsupervised pre-training for speech recognition. Interspeech 2019, 2019.
- What makes for good views for contrastive learning. ArXiv, abs/2005.10243, 2020.
- Representation learning with contrastive predictive coding, 2018.
- Luyu Wang and Aaron van den Oord. Multi-format contrastive learning of audio representations, 2021.
- A fourier-based framework for domain generalization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Fda: Fourier domain adaptation for semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Large batch training of convolutional networks, 2017.
- Barlow twins: Self-supervised learning via redundancy reduction, 2021.
- Colorful image colorization. In European Conference on Computer Vision, 2016.
- Self-supervised contrastive pre-training for time series via time-frequency consistency, 2022.
- Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.