Augmentations vs Algorithms: What Works in Self-Supervised Learning (2403.05726v1)
Abstract: We study the relative effects of data augmentations, pretraining algorithms, and model architectures in Self-Supervised Learning (SSL). While the recent literature in this space leaves the impression that the pretraining algorithm is of critical importance to performance, understanding its effect is complicated by the difficulty in making objective and direct comparisons between methods. We propose a new framework which unifies many seemingly disparate SSL methods into a single shared template. Using this framework, we identify aspects in which methods differ and observe that in addition to changing the pretraining algorithm, many works also use new data augmentations or more powerful model architectures. We compare several popular SSL methods using our framework and find that many algorithmic additions, such as prediction networks or new losses, have a minor impact on downstream task performance (often less than $1\%$), while enhanced augmentation techniques offer more significant performance improvements ($2-4\%$). Our findings challenge the premise that SSL is being driven primarily by algorithmic improvements, and suggest instead a bitter lesson for SSL: that augmentation diversity and data / model scale are more critical contributors to recent advances in self-supervised learning.
- {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283, 2016.
- Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pp. 456–473. Springer, 2022.
- Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15619–15629, 2023.
- A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
- VICReg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xm6YD62D1Ub.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
- Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020b.
- Exploring simple siamese representation learning. in 2021 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15745–15753, 2020.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
- An empirical study of training self-supervised vision transformers. In CVF International Conference on Computer Vision (ICCV), pp. 9620–9629, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Compiling machine learning programs via high-level tracing. Systems for Machine Learning, 4(9), 2018.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Henaff, O. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning, pp. 4182–4192. PMLR, 2020.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
- Disentangling the effects of data augmentation and format transform in self-supervised learning of image representations. arXiv preprint arXiv:2312.02205, 2023.
- Adam: A method for stochastic optimization, 2017.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Random field augmentations for self-supervised representation learning. arXiv preprint arXiv:2311.03629, 2023.
- Nesterov, Y. E. A method of solving a convex programming problem with convergence rate o\\\backslash\bigl(k^2\\\backslash\bigr). In Doklady Akademii Nauk, volume 269, pp. 543–547. Russian Academy of Sciences, 1983.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544, 2016.
- Sassl: Enhancing self-supervised learning via neural style transfer. arXiv preprint arXiv:2312.01187, 2023.
- Weighted ensemble self-supervised learning. In The Eleventh International Conference on Learning Representations, 2022.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Active learning with real annotation costs. In Proceedings of the NIPS workshop on cost-sensitive learning, volume 1. Vancouver, CA:, 2008.
- Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
- Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487, 2021.
- On the importance of asymmetry for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16570–16579, 2022.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.
- Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 649–666. Springer, 2016.
- Zhu, X. Semi-supervised learning with graphs. Carnegie Mellon University, 2005.