Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

20 4

A Cookbook of Self-Supervised Learning (2304.12210v2)

Published 24 Apr 2023 in cs.LG and cs.CV

Abstract: Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are familiar, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. Our goal is to lower the barrier to entry into SSL research by laying the foundations and latest SSL recipes in the style of a cookbook. We hope to empower the curious researcher to navigate the terrain of methods, understand the role of the various knobs, and gain the know-how required to explore how delicious SSL can be.

References (330)

Citations (242)

View on Semantic Scholar

Summary

The paper categorizes self-supervised learning techniques into four families, offering a clear roadmap from contrastive methods to masked image modeling.
It details practical implementation aspects like data augmentation, projector design, and hyperparameter tuning to enhance model performance.
The paper outlines evaluation protocols and extensions to modalities beyond images, emphasizing robustness and versatility in representation learning.

Self-supervised learning (SSL) has emerged as a powerful paradigm to leverage the abundance of unlabeled data, contrasting with traditional supervised learning which is limited by the cost and availability of labeled examples. Successes in NLP with models like BERT and GPT, and in computer vision with models trained on billions of images, demonstrate the potential of SSL to learn versatile representations applicable across various downstream tasks. While SSL offers benefits like improved robustness and applicability in data-scarce domains like medicine, its implementation remains challenging due to the complexity of methods, hyperparameters, and lack of a unified view. "A Cookbook of Self-Supervised Learning" (2304.12210) aims to address this by providing a structured overview of fundamental techniques, practical implementation details, and evaluation strategies.

The paper categorizes SSL methods into four broad families, building upon historical origins dating back to early deep learning:

Origins of SSL: Early methods explored pretext tasks such as:
- Information restoration: Training models to fill in masked or missing parts of data (e.g., colorizing grayscale images, inpainting image patches). This is a precursor to modern Masked Image Modeling.
- Temporal relationships in video: Using the sequential nature of video frames to create prediction tasks (e.g., predicting ego-motion or object movement between frames).
- Learning spatial context: Training models to understand relative positions or orientations (e.g., predicting image rotation, predicting relative patch locations).
- Grouping similar images: Employing clustering techniques like k-means or optimal transport in feature space to group semantically similar inputs.
- Generative models: Using autoencoders or GANs to learn representations, though representation transferability was found to be better when restoring missing parts rather than generating full inputs.
- Multi-view invariance: Encouraging representations to be invariant to transformations, laying the groundwork for modern contrastive methods.
The Deep Metric Learning (DML) Family: This family is based on the principle of making representations of positive pairs (different views of the same input) similar while pushing negative pairs (different inputs) apart.
- Core Idea: Uses loss functions like contrastive loss or triplet loss, often incorporating a margin $m$ to enforce separation.
- Shift to Modern SSL: Moved from sampling pairs based on labels or fixed transforms to using continuous, random data augmentations, deeper networks, and introducing a projector network.
- Key Methods: Includes methods based on the InfoNCE loss, such as SimCLR chen2020simple, MoCo he2020momentum, and their variants (NT-Xent, DCL [yeh2021decoupled], NNCLR [dwibedi2021little]). Hard negative mining is implicitly handled by large batch sizes in InfoNCE.
The Self-Distillation Family: These methods use a teacher-student setup where one network (student) predicts the output of another network (teacher) for different views of the same input.
- Core Idea: Avoids negative pairs. Collapse (where the model outputs constant representations) is prevented through asymmetry between the student and teacher networks.
- Asymmetry Mechanisms: Exponential Moving Average (EMA) updates for the teacher network weights (BYOL [grill2020bootstrap]) or a stop-gradient operation on the teacher's output (SimSiam [chen2021exploring]). A predictor network mapping the student's output to the teacher's target is crucial.
- Key Methods: BYOL, SimSiam, DINO caron2021emerging, iBOT [zhou2021ibot] and DINOv2 oquab2023dinov2.
The Canonical Correlation Analysis (CCA) Family: Inspired by CCA, which analyzes the relationship between two sets of variables via their cross-covariance.
- Core Idea: Learns transformations of two input views such that their resulting representations are maximally correlated, while also imposing constraints to prevent collapse.
- Mechanisms: Methods penalize redundancy (by regularizing covariance matrices to be close to identity) and enforce invariance (representations of two views should be similar) and high variance along dimensions.
- Key Methods: VICReg bardes2021vicreg, Barlow Twins zbontar2021barlow, SWAV [caron2020unsupervised], W-MSE [ermolov2021whitening].
Masked Image Modeling (MIM) Family: Adapts the masked LLMing success to images by masking patches and training a model to reconstruct the missing information.
- Core Idea: Degrades the input by masking and trains the model to undo the degradation. Can predict raw pixel values or discrete visual tokens.
- Key Methods: Context Encoders pathak2016context, BEiT bao2021beit, MAE [he22masked] and SimMIM xie2022simmim.
- Modern Trends: Often combined with distillation (iBOT, DINOv2) and tailored for Vision Transformers (ViTs). MIM can also be used for generative tasks.

Theoretical Unification and Representation Properties:

Research attempts to unify these families, finding links between contrastive and covariance-based methods [garrido2022duality]. Theoretical studies analyze InfoNCE in terms of mutual information, alignment, and uniformity [wang2020understanding]. Dimensional collapse, where representations become redundant, is a key challenge, measured by the rank or singular value distribution of embeddings [garrido22rankme]. The projector network's role [jing2022understanding] is investigated for preventing collapse and handling noisy data augmentations.

Practical Implementation:

Successfully training SSL models involves careful consideration of practical aspects:

Data Augmentation: Crucial for defining invariances. ImageNet-specific augmentations (cropping, color jittering) are common, but the optimal choice can depend on the downstream task [ericsson2021selfsupervised, demo]. Multi-crop [caron2020unsupervised] increases positive pairs but adds computational cost. Alternatives use nearest neighbors in embedding space [dwibedi21little, koohpayegani2021mean].
Projector: A multi-layer perceptron after the encoder, applied to the loss computation but discarded for downstream tasks. Empirically vital for performance [chen2020simple, bardes2021vicreg]. It can help mitigate the impact of inconsistent data augmentations and influences the representation's properties [mialon2022variance, bordes2022guillotine].
Uniform Prior: SSL methods often implicitly assume a uniform data distribution, leading to poor performance on imbalanced datasets [uniform_prior].
Teacher-Student Specifics: EMA in the teacher network provides stability, and the predictor network is essential for avoiding collapse in BYOL/SimSiam [shi2020run, chen2021exploring, tian2021understanding].
Hyperparameters:
- Batch Size: Large batches traditionally required for contrastive methods can be mitigated with careful tuning [chen2020simple, demo, yeh2021decoupled].
- Learning Rate & Optimizers: Common choices include LARS [you2017large] or AdamW [loshchilov2017decoupled] with linear warmup and cosine decay.
- Weight Decay: Important for stability [grill2020bootstrap].
Vision Transformers (ViTs): Require specific tuning for stability (batch size, warmup, LayerDecay, LayerScale, choosing representation source like [cls] token or averaged patches) [touvron2021training, chen2021empirical, caron2021emerging].
High Performance MIM: Often paired with techniques like novel normalization [woo2023convnext], distillation [zhou2021ibot, liu2022exploring], and masking strategies tailored for specific architectures like pyramid ViTs [li2022uniform].
Speeding up Training: Distributed training (DDP, FSDP) requires handling synchronized batch normalization and aggregating batches for loss computation. Libraries like FFCV-SSL [demo] offer significant data loading speedups. ViTs can be sped up by processing only unmasked patches or using efficient attention mechanisms [li2022efficient, dao22flashattention].

Evaluation:

Evaluating SSL models often involves assessing the quality of learned representations on downstream tasks:

Evaluation with Labels: Standard protocols include k-Nearest Neighbors (KNN), Linear Probing (training a linear classifier on frozen features), and Full Fine-tuning (fine-tuning the entire pretrained model). Linear probing is the most common, offering a balance of effectiveness and computational cost. Online linear probes can provide cheaper estimates during training.
Evaluation without Labels: Methods like RankMe [garrido22rankme] use the effective rank of representations as a proxy for downstream performance, useful for hyperparameter tuning without labels. $\alpha$ -ReQ [agrawal2022alphareq] uses eigenspectrum decay.
Beyond Classification: Evaluating on dense prediction tasks (detection, segmentation) is gaining importance, but standardized protocols are still evolving.
Visual Evaluation: Using conditional generative models (like RCDM [bordes2022high]) allows visualizing the information encoded in different layers of the representation.

Extending Beyond Images:

SSL concepts extend to other modalities:

Audio: Similarities to images (e.g., spectrograms) but requires domain-specific augmentations (e.g., masking time/frequency bands). Can use reconstruction or contrastive objectives. Multi-modal audio-video provides natural positive pairs.
Video: Many image SSL methods have video counterparts, incorporating the temporal dimension. Masked video modeling predicts patches/tubes over time. Pretrained image models can be transferred to video tasks.
Text: Masked LLMing (MLM) and next token prediction are dominant. Text augmentations are different (masking tokens/spans, permutation). Contrastive methods exist but are less prevalent than reconstruction for large models.
Tabular Data: Uses masking and image-inspired augmentations (mixup, cutmix). Reconstruction and contrastive losses are applied. Performance boost compared to supervised learning is less pronounced than in other domains.
Reinforcement Learning (RL): SSL improves sample efficiency and exploration in RL with visual inputs. Contrastive methods match representations across states, augmentations, or timesteps. Foundation models for RL pretrain on large datasets of human videos using MIM or time-contrastive objectives. Challenges include handling correlated data and focusing on slow features.
Multiple Modalities: Jointly training on multiple modalities, notably vision and language (CLIP [radford2021learning], ALIGN [jia2021scaling]), learns powerful representations by aligning embeddings of corresponding samples (e.g., image-caption pairs). These models excel at zero-shot transfer and out-of-domain tasks and enable multi-modal applications and generation. Combining vision-language with image-image SSL can further boost performance [mu2022slip].

In conclusion, this "cookbook" provides a valuable resource for researchers and practitioners by organizing the diverse landscape of SSL methods, detailing practical implementation strategies, and discussing evaluation techniques across various data domains and tasks. It highlights common challenges and offers insights into hyperparameter tuning, architecture choices, and computational optimizations, aiming to democratize access to this rapidly evolving field.

PDF Markdown

Tweets

https://twitter.com/ZainHasan6/status/1752587872916394038

https://twitter.com/randall_balestr/status/1806506696212136145

https://twitter.com/AshifMarsh42991/status/1879874085691519060

https://twitter.com/Siliconlad/status/1780493758544416806

https://twitter.com/activewarp/status/1772368910668419077

https://twitter.com/nsrt_py/status/1650917442342858753

YouTube

Show All Videos

A Cookbook of Self-Supervised Learning (2304.12210v2)

Summary

Related Papers

Tweets

YouTube