2000 character limit reached
A Cookbook of Self-Supervised Learning (2304.12210v2)
Published 24 Apr 2023 in cs.LG and cs.CV
Abstract: Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are familiar, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. Our goal is to lower the barrier to entry into SSL research by laying the foundations and latest SSL recipes in the style of a cookbook. We hope to empower the curious researcher to navigate the terrain of methods, understand the role of the various knobs, and gain the know-how required to explore how delicious SSL can be.
- α𝛼\alphaitalic_α-req : Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ii9X4vtZGTZ.
- Learning to see by moving. In Proceedings of the IEEE international conference on computer vision, pages 37–45, 2015.
- L. Aitchison and S. Ganev. Infonce is a variational autoencoder, 2023.
- Flamingo: A Visual Language Model for Few-Shot Learning. arxiv:2204.14198[cs], Nov. 2022. doi: 10.48550/arXiv.2204.14198. URL http://arxiv.org/abs/2204.14198.
- Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 99, 2015.
- Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33:9758–9770, 2020.
- Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255. PMLR, 2013.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- On the Role of Bidirectionality in Language Model Pre-Training. arxiv:2205.11726[cs], May 2022. doi: 10.48550/arXiv.2205.11726. URL http://arxiv.org/abs/2205.11726.
- Holo-dex: Teaching dexterity with immersive mixed reality. (arXiv:2210.06463), Oct 2022. URL http://arxiv.org/abs/2210.06463. arXiv:2210.06463 [cs].
- Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371, 2019.
- Pass: An imagenet replacement for self-supervised pretraining without humans. arXiv preprint arXiv:2109.13228, 2021.
- The hidden uniform cluster prior in self-supervised learning, 2022a. URL https://arxiv.org/abs/2210.07277.
- Masked siamese networks for label-efficient learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI, pages 456–473. Springer, 2022b.
- Masked siamese networks for label-efficient learning. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI, page 456–473, Berlin, Heidelberg, 2022c. Springer-Verlag. ISBN 978-3-031-19820-5. doi: 10.1007/978-3-031-19821-2_26. URL https://doi.org/10.1007/978-3-031-19821-2_26.
- Self-supervised learning from images with a joint-embedding predictive architecture, 2023.
- Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
- Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
- Efficient self-supervised learning with contextualized target representations for vision, speech and language. arXiv preprint arXiv:2212.07525, 2022.
- Scarf: Self-supervised contrastive learning using random feature corruption. arXiv preprint arXiv:2106.15147, 2021.
- R. Balestriero and Y. LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508, 2022.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021a.
- BEiT: BERT pre-training of image transformers. 2021b.
- Detreg: Unsupervised pretraining with region priors for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14605–14615, 2022.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. International Conference on Learning Representations, 2021.
- Vicregl: Variance-invariance-covariance regularization for self-supervised learning. Advances in neural information processing systems, 2022.
- Mutual information neural estimation. In International Conference on Machine Learning, pages 531–540. PMLR, 2018.
- Y. Bengio and J.-S. Senécal. Quick training of probabilistic neural nets by importance sampling. In International Workshop on Artificial Intelligence and Statistics, pages 17–24. PMLR, 2003.
- Y. Bengio and J.-S. Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.
- Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
- Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.
- P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In International Conference on Machine Learning, pages 517–526. PMLR, 2017.
- Guillotine regularization: Improving deep networks generalization by removing their head. 2022a.
- High fidelity visualization of what your self-supervised representation knows about. Transactions on Machine Learning Research, 2022b. URL https://openreview.net/forum?id=urfWb7VjmL.
- Towards democratizing joint-embedding self-supervised learning, 2023a. URL https://arxiv.org/abs/2303.01986.
- A surprisingly simple technique to control the pretraining bias for better transfer: Expand or narrow your representation, 2023b.
- L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580–598, 1985.
- Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems, 6, 1993.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- C. Ceylan and M. U. Gutmann. Conditional noise-contrastive estimation of unnormalised models. In International Conference on Machine Learning, pages 726–734. PMLR, 2018.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(3), 2010.
- Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021a.
- Clower: A pre-trained language model with contrastive learning over word and character representations. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3098–3108, 2022.
- Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
- Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020c.
- X. Chen and K. He. Exploring simple siamese representation learning. (arXiv:2011.10566), Nov 2020. URL http://arxiv.org/abs/2011.10566. arXiv:2011.10566 [cs].
- X. Chen and K. He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020d.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021b.
- Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Unsupervised cross-modal alignment of speech and text embedding spaces. Advances in neural information processing systems, 31, 2018.
- Self supervised contrastive learning for digital histopathology. Machine Learning with Applications, 7:100198, 2022. ISSN 2666-8270. doi: https://doi.org/10.1016/j.mlwa.2021.100198. URL https://www.sciencedirect.com/science/article/pii/S2666827021000992.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
- A comparison of metric learning loss functions for end-to-end speaker verification. In International Conference on Statistical Language and Speech Processing, pages 137–148. Springer, 2020.
- Toward a geometrical understanding of self-supervised contrastive learning. arXiv preprint arXiv:2205.06926, 2022.
- From play to policy: Conditional behavior generation from uncurated robot data. (arXiv:2210.10047), Dec 2022. URL http://arxiv.org/abs/2210.10047. arXiv:2210.10047 [cs].
- Good semi-supervised learning that requires a bad gan. Advances in neural information processing systems, 30, 2017.
- Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
- Equivariant contrastive learning. arXiv preprint arXiv:2111.00899, 2021.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- CogView: Mastering Text-to-Image Generation via Transformers. In Advances in Neural Information Processing Systems, volume 34, pages 19822–19835. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/hash/a4d92e2cd541fca87e4620aba658316d-Abstract.html.
- Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430, 2015.
- Adversarial feature learning. In International Conference on Learning Representations, 2017.
- Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27, 2014.
- An Empirical Study of Training End-to-End Vision-and-Language Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Dou_An_Empirical_Study_of_Training_End-to-End_Vision-and-Language_Transformers_CVPR_2022_paper.html.
- Improving Self-Supervised Learning by Characterizing Idealized Representations, Dec. 2022. URL http://arxiv.org/abs/2209.06235. arXiv:2209.06235 [cs, stat].
- With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9588–9597, 2021.
- C. Dyer. Notes on noise contrastive estimation and negative sampling. arXiv preprint arXiv:1410.8251, 2014.
- Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
- Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
- Why do self-supervised models transfer? investigating the impact of invariance on downstream tasks, 2021a.
- How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5414–5423, 2021b.
- Whitening for self-supervised representation learning. In International Conference on Machine Learning, pages 3015–3024. PMLR, 2021.
- Contrastive learning as goal-conditioned reinforcement learning. (arXiv:2206.07568), Jun 2022. URL http://arxiv.org/abs/2206.07568. number: arXiv:2206.07568 arXiv:2206.07568 [cs].
- FairScale. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
- A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022a.
- EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arxiv:2211.07636[cs], Dec. 2022b. doi: 10.48550/arXiv.2211.07636. URL http://arxiv.org/abs/2211.07636.
- A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3299–3309, 2021.
- Masked autoencoders as spatiotemporal learners. arXiv preprint arXiv:2205.09113, 2022.
- Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
- Simcse: Simple contrastive learning of sentence embeddings. In EMNLP (1), 2021.
- Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. arXiv preprint arXiv:2210.02885, 2022a.
- On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574, 2022b.
- Self-supervised learning of split invariant equivariant representations. arXiv preprint arXiv:2302.10283, 2023.
- What do Vision Transformers Learn? A Visual Exploration. arxiv:2212.06727[cs], Dec. 2022. doi: 10.48550/arXiv.2212.06727. URL http://arxiv.org/abs/2212.06727.
- Investigating power laws in deep representation learning. arXiv preprint arXiv:2202.05808, 2022.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Declutr: Deep contrastive learning for unsupervised textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895, 2021.
- Omnimae: Single model masked pretraining on images and videos. arXiv preprint arXiv:2206.08356, 2022.
- Neighbourhood components analysis. Advances in neural information processing systems, 17, 2004.
- Deep learning, volume 1. MIT Press, 2016.
- Generative adversarial networks. In stat, 2014.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the ieee/cvf International Conference on computer vision, pages 6391–6400, 2019.
- Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021.
- Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360, 2022.
- Ego4d: Around the world in 3,000 hours of egocentric video. (arXiv:2110.07058), Mar 2022. doi: 10.48550/arXiv.2110.07058. URL http://arxiv.org/abs/2110.07058. arXiv:2110.07058 [cs].
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2021.
- Tight mutual information estimation with contrastive fenchel-legendre optimization, 2022a.
- Byol-explore: Exploration by bootstrapped prediction. (arXiv:2206.08332), Jun 2022b. URL http://arxiv.org/abs/2206.08332. arXiv:2206.08332 [cs, stat].
- M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
- Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play. (arXiv:2303.12076), Mar 2023. URL http://arxiv.org/abs/2303.12076. arXiv:2303.12076 [cs].
- Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
- Predictor networks and stop-grads provide implicit variance regularization in byol/simsiam. arXiv preprint arXiv:2212.04858, 2022.
- Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Advances in Neural Information Processing Systems, 2021.
- Overview of supervised learning. In The elements of statistical learning, pages 9–41. Springer, 2009.
- B. He and M. Ozay. Exploring the gap between collapsed & whitened features in self-supervised learning. In International Conference on Machine Learning, pages 8613–8634. PMLR, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020a.
- Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020b.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- O. Henaff. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning, pages 4182–4192. PMLR, 2020.
- Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems, 32, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
- A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Learning deep representations by mutual information estimation and maximization, 2019.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- H. Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pages 162–190. Springer, 1992.
- W. W. Hsieh. Nonlinear canonical correlation analysis by neural networks. Neural Networks, 13(10):1095–1105, 2000.
- On feature decorrelation in self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9598–9608, 2021.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Searching large neighborhoods for integer linear programs with contrastive learning. International Conference on Learning Representation, 2023.
- Towards the generalization of contrastive self-supervised learning. arXiv preprint arXiv:2111.00743, 2021.
- A. Hyvarinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016.
- Robust self-supervised learning with lie groups. arXiv preprint arXiv:2210.13356, 2022.
- Tabbie: Pretrained representations of tabular data. arXiv preprint arXiv:2105.02584, 2021.
- Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
- Curiosity in hindsight. (arXiv:2211.10515), Nov 2022. URL http://arxiv.org/abs/2211.10515. arXiv:2211.10515 [cs, stat].
- On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
- The power of contrast for feature learning: A theoretical analysis. arXiv preprint arXiv:2110.02473, 2021.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, July 2021. URL https://proceedings.mlr.press/v139/jia21b.html.
- Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YevsQ05DEN7.
- Discriminative clustering for image co-segmentation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1943–1950. IEEE, 2010.
- Model-based reinforcement learning for atari. arXiv:1903.00374 [cs, stat], Feb 2020. URL http://arxiv.org/abs/1903.00374. arXiv: 1903.00374.
- Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809, 2020.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
- Variational autoencoders and nonlinear ica: A unifying framework, 2020.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Mean shift for self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10326–10335, 2021.
- Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering, pages 1–7, 2022.
- Learning representations for automatic colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 577–593. Springer, 2016.
- ffcv. https://github.com/libffcv/ffcv/, 2022. commit xxxxxxx.
- D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
- Transfer learning with deep tabular models. International Conference on Learning Representations (ICLR), 2023.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
- Understanding collapse in non-contrastive siamese representation learning. In European Conference on Computer Vision, pages 490–505. Springer, 2022a.
- Efficient self-supervised vision transformers for representation learning. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=fVu3o-YUGQK.
- Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966, 2020.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, pages 12888–12900. PMLR, June 2022c. URL https://proceedings.mlr.press/v162/li22n.html.
- Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063, 2022d.
- Self-supervised learning with kernel dependence maximization. Advances in Neural Information Processing Systems, 34:15543–15556, 2021a.
- Learning disentangled representation with pairwise independence. In Proceedings of the AAAI conference on artificial intelligence, 2019.
- Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems, 34:13165–13176, 2021b.
- Audio self-supervised learning: A survey. arXiv preprint arXiv:2203.01205, 2022a.
- Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2022b.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. (arXiv:2210.00030), Sep 2022. URL http://arxiv.org/abs/2210.00030. arXiv:2210.00030 [cs].
- Z. Ma and M. Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812, 2018.
- A laplacian framework for option discovery in reinforcement learning. arXiv:1703.00956 [cs], Jun 2017. URL http://arxiv.org/abs/1703.00956. arXiv: 1703.00956.
- Where are we in the search for an artificial visual cortex for embodied intelligence?
- An efficient algorithm for information decomposition and extraction. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 972–979. IEEE, 2015.
- Equivariant representation learning via class-pose decomposition. arXiv preprint arXiv:2207.03116, 2022.
- Coco-lm: Correcting and contrasting text sequences for language model pretraining. Advances in Neural Information Processing Systems, 34:23102–23114, 2021.
- Variance-covariance regularization enforces pairwise independence in self-supervised representations. arXiv preprint arXiv:2209.14905, 2022.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Recent advances in natural language processing via large pre-trained language models: A survey. arXiv preprint arXiv:2111.01243, 2021.
- Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022.
- I. Misra and L. v. d. Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
- Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922, 2020.
- Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
- A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. Advances in neural information processing systems, 26, 2013.
- A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- SLIP: Self-supervision Meets Language-Image Pre-training. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, Lecture Notes in Computer Science, pages 529–544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0. doi: 10.1007/978-3-031-19809-0_30.
- R3m: A universal visual representation for robot manipulation. (arXiv:2203.12601), Nov 2022. URL http://arxiv.org/abs/2203.12601. arXiv:2203.12601 [cs].
- Expanding language-image pretrained models for general video recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 1–18. Springer, 2022.
- Data augmentation for meta-learning. In International Conference on Machine Learning, pages 8152–8161. PMLR, 2021a.
- The close relationship between contrastive learning and meta-learning. In International Conference on Learning Representations, 2021b.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arxiv:2112.10741[cs], Mar. 2022. doi: 10.48550/arXiv.2112.10741. URL http://arxiv.org/abs/2112.10741.
- M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
- Representation learning by learning to count. In Proceedings of the IEEE international conference on computer vision, pages 5898–5906, 2017.
- Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9359–9367, 2018.
- NVidia. Apex. https://github.com/nvidia/apex, 2021.
- Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Dinov2: Learning robust visual features without supervision, 2023.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Ambient sound provides supervision for visual learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 801–816. Springer, 2016.
- Nonlinear canonical correlation analysis: A compressed representation approach. Entropy, 22(2):208, 2020.
- The surprising effectiveness of representation learning for visual imitation. In Robotics: Science and Systems XVIII. Robotics: Science and Systems Foundation, Jun 2022. ISBN 978-0-9923747-8-5. doi: 10.15607/RSS.2022.XVIII.010. URL http://www.roboticsproceedings.org/rss18/p010.pdf.
- Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
- The unsurprising effectiveness of pre-trained vision models for control. (arXiv:2203.03580), Aug 2022. URL http://arxiv.org/abs/2203.03580. arXiv:2203.03580 [cs].
- Learning symmetric embeddings for equivariant world models. arXiv preprint arXiv:2204.11371, 2022.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Learning features by watching objects move. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2701–2710, 2017.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
- On the pros and cons of momentum encoder in self-supervised visual representation learning. arXiv preprint arXiv:2208.05744, 2022.
- Beyond target networks: Improving deep q𝑞qitalic_q-learning with functional regularization. arXiv preprint arXiv:2106.02613, 2021.
- A self-supervised descriptor for image copy detection, 2022.
- Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature communications, 11(1):4381, 2020.
- S. Purushwalkam and A. Gupta. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. Advances in Neural Information Processing Systems, 33:3407–3418, 2020.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020[cs], Feb. 2021. doi: 10.48550/arXiv.2103.00020. URL http://arxiv.org/abs/2103.00020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- DenseCLIP: Language-Guided Dense Prediction With Context-Aware Prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Rao_DenseCLIP_Language-Guided_Dense_Prediction_With_Context-Aware_Prompting_CVPR_2022_paper.html.
- Selfaugment: Automatic augmentation policies for self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2674–2683, 2021.
- A Generalist Agent. Transactions on Machine Learning Research, Nov. 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=1ikK0kHjvj.
- Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
- Revisiting pretraining objectives for tabular deep learning. arXiv preprint arXiv:2207.03208, 2022.
- Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Improving the generalization of supervised models, 2022.
- Self-supervised learning through efference copies. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=DotEQCtY67g.
- Self-supervised learning for videos: A survey. ACM Computing Surveys, 2022a.
- Self-supervised learning for videos: A survey. ACM Computing Surveys, 2022b.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, Dec 2020. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-020-03051-4. arXiv: 1911.08265.
- Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
- LAION-5B: An open large-scale dataset for training next generation image-text models. arxiv:2210.08402[cs], Oct. 2022. doi: 10.48550/arXiv.2210.08402. URL http://arxiv.org/abs/2210.08402.
- Data-efficient reinforcement learning with self-predictive representations. arXiv:2007.05929 [cs, stat], May 2021a. URL http://arxiv.org/abs/2007.05929. arXiv: 2007.05929.
- Pretraining representations for data-efficient reinforcement learning. arXiv:2106.04799 [cs], Jun 2021b. URL http://arxiv.org/abs/2106.04799. arXiv: 2106.04799.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018.
- How Much Can CLIP Benefit Vision-and-Language Tasks? In International Conference on Learning Representations, Jan. 2022. URL https://openreview.net/forum?id=zf_Ll3HZWgy.
- Run away from your teacher: Understanding byol by a novel self-supervised approach. arXiv preprint arXiv:2011.10944, 2020.
- Pre-train your loss: Easy bayesian transfer learning with informative priors. In Advances in Neural Information Processing Systems.
- What do we maximize in self-supervised learning? arXiv preprint arXiv:2207.10081, 2022.
- FLAVA: A Foundational Language and Vision Alignment Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html.
- Joint embedding predictive architectures focus on slow features. (arXiv:2211.10831), Nov 2022. URL http://arxiv.org/abs/2211.10831. arXiv:2211.10831 [cs].
- K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.
- Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
- J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
- Curl: Contrastive unsupervised representations for reinforcement learning. arXiv:2004.04136 [cs, stat], Sep 2020. URL http://arxiv.org/abs/2004.04136. arXiv: 2004.04136.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Unifying visual contrastive learning for object recognition from a graph perspective. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 649–667. Springer, 2022.
- Exploring the equivalence of siamese self-supervised learning via a unified gradient framework. arXiv preprint arXiv:2112.05141, 2021.
- A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Unifying Language Learning Paradigms. arxiv:2205.05131[cs], May 2022. URL http://arxiv.org/abs/2205.05131.
- Isd: Self-supervised learning by iterative similarity distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9609–9618, 2021.
- Y. Tian. Understanding deep contrastive learning via coordinate-wise optimization. CoRR, abs/2201.12680, 2022. URL https://arxiv.org/abs/2201.12680.
- Y. Tian. Understanding the role of nonlinearity in training dynamics of contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=s130rTE3U_X.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020a.
- Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020b.
- Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
- Pushing the limits of self-supervised resnets: Can we outperform supervised learning without labels on imagenet? arXiv preprint arXiv:2201.05119, 2022.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
- Does zero-shot reinforcement learning exist? 2023.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021b.
- Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022.
- On mutual information maximization for representation learning, 2020.
- Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Deep graph infomax, 2018.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
- Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
- Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), pages 391–408, 2018.
- Solving inefficiency of self-supervised representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9505–9515, 2021a.
- Learning Robust Global Representations by Penalizing Local Predictive Power. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/3eefceb8087e964f89c2d59e8a249915-Abstract.html.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020a.
- T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
- T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2022.
- What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? arXiv:2204.05832 [cs, stat], Apr. 2022a. URL http://arxiv.org/abs/2204.05832.
- On deep multi-view representation learning. In International conference on machine learning, pages 1083–1092. PMLR, 2015.
- Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6889–6893. IEEE, 2020b.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
- X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2015.
- Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2021b.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021c.
- On the importance of asymmetry for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16570–16579, 2022c.
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In International Conference on Learning Representations, Jan. 2022d. URL https://openreview.net/forum?id=GUrhfTuf_3.
- Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141, 2022.
- Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.
- Mixing up contrastive learning: Self-supervised representation learning for time series. Pattern Recognition Letters, 155:54–61, 2022.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808, 2023.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
- What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659, 2020.
- Masked visual pre-training for motor control. (arXiv:2203.06173), Mar 2022. URL http://arxiv.org/abs/2203.06173. arXiv:2203.06173 [cs].
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
- Maximum margin clustering. Advances in neural information processing systems, 17, 2004.
- Instance localization for self-supervised detection pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3987–3996, 2021.
- Doc: Improving long story coherence with detailed outline control. Proceedings of the 61st Annual Meeting of the Annual Meeting of the Association for Computational Linguistics, 2023.
- Reinforcement learning with prototypical representations. arXiv:2102.11271 [cs], Jul 2021. URL http://arxiv.org/abs/2102.11271. arXiv: 2102.11271.
- Mastering atari games with limited data. arXiv:2111.00210 [cs], Dec 2021. URL http://arxiv.org/abs/2111.00210. arXiv: 2111.00210.
- Decoupled contrastive learning. arXiv preprint arXiv:2110.06848, 2021.
- Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyx-jyBFPr.
- Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33:11033–11043, 2020.
- How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/375c71349b295fbe2dcdca9206f20a06-Paper.pdf.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- B. Yu and D. Tao. Deep metric learning with tuplet margin loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6490–6499, 2019.
- Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research.
- Florence: A New Foundation Model for Computer Vision. arxiv:2111.11432[cs], Nov. 2021. doi: 10.48550/arXiv.2111.11432. URL http://arxiv.org/abs/2111.11432.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8354–8363, 2022.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
- Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022a.
- LiT: Zero-Shot Transfer With Locked-Image Text Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022b. URL https://openaccess.thecvf.com/content/CVPR2022/html/Zhai_LiT_Zero-Shot_Transfer_With_Locked-Image_Text_Tuning_CVPR_2022_paper.html.
- Learning invariant representations for reinforcement learning without reconstruction. (arXiv:2006.10742), Apr 2021. URL http://arxiv.org/abs/2006.10742. arXiv:2006.10742 [cs, stat].
- Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14441–14450, 2022a.
- How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. arXiv preprint arXiv:2203.16262, 2022b.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
- Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1058–1067, 2017.
- Light-weight probing of unsupervised representations for reinforcement learning. (arXiv:2208.12345), Aug 2022c. URL http://arxiv.org/abs/2208.12345. arXiv:2208.12345 [cs].
- The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pages 570–586, 2018.
- Distilling localization for self-supervised representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10990–10998, 2021.
- Learning deep features for scene recognition using places database. In NeurIPS, 2014.
- ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022a.
- Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022b.
- Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
- A. Ziegler and Y. M. Asano. Self-supervised learning of object parts for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14502–14511, 2022.