Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations (2312.02205v1)

Published 2 Dec 2023 in cs.CV and cs.LG

Abstract: Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
  3. A cookbook of self-supervised learning, 2023.
  4. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  5. The fast fourier transform. IEEE Spectrum, 4(12):63–70, 1967.
  6. Frequency domain image translation: More photo-realistic, better identity-preserving. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  7. Unsupervised learning of visual features by contrasting cluster assignments, 2020.
  8. A simple framework for contrastive learning of visual representations, 2020a.
  9. Exploring simple siamese representation learning, 2020.
  10. Improved baselines with momentum contrastive learning, 2020b.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  12. Why do self-supervised models transfer? investigating the impact of invariance on downstream tasks. ArXiv, abs/2111.11398, 2021.
  13. Unsupervised representation learning by predicting image rotations. ArXiv, abs/1803.07728, 2018.
  14. Bootstrap your own latent: A new approach to self-supervised learning, 2020.
  15. Structural sparseness and spatial phase alignment in natural scenes. JOSA A, 24(7):1873–1885, 2007.
  16. Deep residual learning for image recognition, 2015.
  17. Masked autoencoders are scalable vision learners, 2021.
  18. Gauss and the history of the fast fourier transform. IEEE ASSP Magazine, 1(4):14–21, 1984.
  19. The inaturalist species classification and detection dataset, 2018.
  20. Far: Fourier aerial video recognition, 2022.
  21. Improving transferability of representations via augmentation-aware self-supervision, 2021.
  22. Unsupervised learning of visual representations by solving jigsaw puzzles. ArXiv, abs/1603.09246, 2016.
  23. The importance of phase in signals. Proceedings of the IEEE, 69(5):529–541, 1981.
  24. Phase in speech and pictures. In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 632–637, 1979.
  25. Dinov2: Learning robust visual features without supervision, 2023.
  26. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
  27. A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception, 11(3):337–346, 1982. PMID: 7167342.
  28. Augmentation-aware self-supervised learning with guided projector. ArXiv, abs/2306.06082, 2023.
  29. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. ArXiv, abs/2007.13916, 2020.
  30. Learning transferable visual models from natural language supervision, 2021.
  31. Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Internet of Things Journal, 8(2):1030–1040, 2021.
  32. wav2vec: Unsupervised pre-training for speech recognition. Interspeech 2019, 2019.
  33. What makes for good views for contrastive learning. ArXiv, abs/2005.10243, 2020.
  34. Representation learning with contrastive predictive coding, 2018.
  35. Luyu Wang and Aaron van den Oord. Multi-format contrastive learning of audio representations, 2021.
  36. A fourier-based framework for domain generalization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  37. Fda: Fourier domain adaptation for semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  38. Large batch training of convolutional networks, 2017.
  39. Barlow twins: Self-supervised learning via redundancy reduction, 2021.
  40. Colorful image colorization. In European Conference on Computer Vision, 2016.
  41. Self-supervised contrastive pre-training for time series via time-frequency consistency, 2022.
  42. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Citations (1)

Summary

  • The paper introduces Fourier Domain Augmentations (FDA) as a novel method that complements traditional pixel-based augmentations in self-supervised learning.
  • The study demonstrates that integrating FDA with standard techniques yields around a 1% improvement in ImageNet classification accuracy.
  • The research reveals that combining dual-view training with frequency and image representations enhances model robustness for transfer and few-shot learning tasks.

In the field of AI, one of the significant challenges is teaching AI models to understand and interpret visual data. This is where Self-Supervised Learning (SSL) comes into play, especially when we have limited labeled data available. SSL largely relies on data augmentation techniques that introduce variability in training data, thereby helping models to generalize better to unseen data.

However, the common practice has been to focus primarily on transformations directly made on image pixels, such as random cropping, color adjustments, and flips. But what about transformations not in the image domain but in the frequency domain, where the image is represented in terms of its constituent frequencies?

A paper tackles this overlooked aspect head-on by dissecting the role of frequency-domain augmentations in SSL. The research introduces a novel set of methods known as Fourier Domain Augmentations (FDA) that apply data augmentation in the frequency space. This approach complements the classical image-space augmentations and has shown to significantly enhance the performance of image representation learning.

FDA involves applying a set of transformations, such as amplitude scaling, phase shifting, and frequency masking, directly to the image's frequency components. These augmentations are responsible for introducing changes that are not easily replicated by standard image augmentations—effects like unique texture modifications and alterations in color distribution that help in learning more robust image representations.

The research provides evidence that integrating FDA with existing augmentation techniques consistently improves performance across several SSL frameworks, including SimCLR, BYOL, MoCov2, and SimSiam. For instance, on the ImageNet-1K dataset, models pre-trained with FDA achieved around a 1% increase in classification accuracy.

Interestingly, the paper also explores the sole influence of the format transform—using the frequency representation of the image alongside its raw form. The paper demonstrates that presenting the model with these two different views of the same data during pre-training can lead to more informative representations, though the best results were achieved when frequency-based and image-based augmentations were combined.

In application to downstream tasks like transfer learning and few-shot learning setups, where SSL pre-trained models are fine-tuned with minimal data, FDA once again proves beneficial. It not only enhances model adaptability to new domains and tasks but also improves qualitative measures such as image retrieval performance.

The findings prompt further questions in the field, one of them being how to utilize the frequency domain more effectively without the need to transform back into the image space. Moreover, given that the research primarily deals with real images, it opens up avenues for exploring the applicability of FDA to other domains like medical imaging.

In conclusion, the paper suggests that considering both image-domain and frequency-domain augmentations could be a vital step in advancing the capabilities of AI in tasks involving visual perception. The insight drawn from the research underlines the importance of creating diverse and comprehensive training environments for self-supervised models to garner the necessary robustness and flexibility for practical implementations.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets