This paper introduces Bootstrap Your Own Latent (BYOL), a self-supervised learning approach for image representations that achieves state-of-the-art results without relying on negative pairs, unlike popular contrastive methods like SimCLR or MoCo.
Core Idea:
BYOL learns by predicting a target network's representation of an augmented view of an image using an online network that sees a different augmented view of the same image. The key components that prevent collapse (where the network outputs a constant representation for all inputs) are hypothesized to be:
- An asymmetric architecture with an additional predictor MLP in the online network.
- Using a slowly evolving target network, whose weights are an exponential moving average (EMA) of the online network's weights.
Methodology:
- Architecture: BYOL employs two networks: an online network and a target network. Both share the same architecture, consisting of an encoder (f), a projector (g), but the online network has an additional predictor (q).
- Process:
- Given an image x, two augmented views v and v′ are created using augmentations t∼T and t′∼T′.
- The online network processes v: yθ=fθ(v), zθ=gθ(yθ), pθ=qθ(zθ).
- The target network processes v′: yξ′=fξ(v′), zξ′=gξ(yξ′).
- The loss aims to maximize the similarity between the online network's prediction pθ and the target network's projection zξ′. Specifically, it minimizes the mean squared error between the L2-normalized vectors: Lθ,ξ∝∥pθ−zξ′∥22, where u=u/∥u∥2.
- The loss is symmetrized by also feeding v′ to the online network and v to the target network and adding the resulting loss term.
- Optimization: The online network parameters θ are updated via gradient descent on the loss LBYOLθ,ξ. Crucially, the gradient is not propagated through the target network (stop-gradient applied to zξ′).
- Target Network Update: The target network parameters ξ are not updated by the optimizer but are an EMA of the online parameters: ξ←τξ+(1−τ)θ. The target decay rate τ is typically scheduled from a value like 0.996 towards 1.0 during training.
- Representation: After training, only the online encoder fθ is kept and used as the image representation y=fθ(x).
Implementation Details:
- Encoder: Standard ResNet architectures (ResNet-50, larger variants).
- Projector/Predictor: MLPs (Linear -> BN -> ReLU -> Linear), outputting 256-dimensional vectors.
- Augmentations: Similar strong augmentations as SimCLR (random crop, color jitter, grayscale, Gaussian blur, solarization), with some asymmetry between the two views (e.g., blur/solarization applied with different probabilities).
- Optimization: LARS optimizer, cosine decay learning rate schedule, large batch size (e.g., 4096), 1000 epochs.
Experimental Results:
- ImageNet Linear Evaluation: Achieved SOTA with 74.3% top-1 accuracy using ResNet-50, and 79.6% with ResNet-200 (2x), significantly outperforming prior methods and closing the gap to supervised baselines.
- Semi-Supervised Learning: Outperformed previous methods on ImageNet when fine-tuning with 1% or 10% of labels.
- Transfer Learning: Demonstrated strong transfer performance, outperforming SimCLR on numerous classification benchmarks (CIFAR, Food101, etc.) and often matching or exceeding supervised pre-training. It also showed SOTA results on semantic segmentation (PASCAL VOC), object detection (PASCAL VOC), and depth estimation (NYUv2).
Ablation Studies & Insights:
- Negative Pairs: BYOL works effectively without negative pairs, unlike contrastive methods. Adding negative pairs back did not significantly improve (and could hurt) performance without careful tuning.
- Batch Size: Significantly more robust to smaller batch sizes than SimCLR. Performance remained high even when reducing the batch size from 4096 to 256.
- Augmentations: Less sensitive to the choice of augmentations than SimCLR. Performance drop was much smaller when removing color jitter or using only random crops.
- Target Network: The EMA update is crucial. Instant updates (τ=0) or a fixed target (τ=1) led to collapse or poor performance. Values between 0.9 and 0.999 worked well.
- Predictor: The predictor network is essential. Removing it led to collapse, similar to an unsupervised version of Mean Teacher. Experiments suggest maintaining a near-optimal predictor (e.g., via higher learning rate) might be key to preventing collapse, potentially explaining the role of the slow-moving target.
Conclusion:
BYOL presents a novel and effective approach to self-supervised learning that achieves SOTA results without negative pairs. Its success relies on the interaction between an online network predicting the output of a slow-moving average target network, using an asymmetric architecture with a predictor. It demonstrates greater robustness to batch size and augmentation choices compared to contrastive methods.