Bootstrap your own latent: A new approach to self-supervised Learning (2006.07733v3)

Published 13 Jun 2020 in cs.LG, cs.CV, and stat.ML

Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches $74.3\%$ top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and $79.6\%$ with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub.

Authors (14)

Jean-Bastien Grill (13 papers)
Florian Strub (39 papers)
Florent Altché (18 papers)
Corentin Tallec (16 papers)
Pierre H. Richemond (15 papers)
Elena Buchatskaya (9 papers)
Carl Doersch (34 papers)
Bernardo Avila Pires (21 papers)
Zhaohan Daniel Guo (15 papers)
Mohammad Gheshlaghi Azar (31 papers)
Bilal Piot (40 papers)
Koray Kavukcuoglu (57 papers)
Rémi Munos (121 papers)
Michal Valko (91 papers)

Citations (6,132)

View on Semantic Scholar

Summary

This paper introduces Bootstrap Your Own Latent (BYOL), a self-supervised learning approach for image representations that achieves state-of-the-art results without relying on negative pairs, unlike popular contrastive methods like SimCLR or MoCo.

Core Idea:

BYOL learns by predicting a target network's representation of an augmented view of an image using an online network that sees a different augmented view of the same image. The key components that prevent collapse (where the network outputs a constant representation for all inputs) are hypothesized to be:

An asymmetric architecture with an additional predictor MLP in the online network.
Using a slowly evolving target network, whose weights are an exponential moving average (EMA) of the online network's weights.

Methodology:

Architecture: BYOL employs two networks: an online network and a target network. Both share the same architecture, consisting of an encoder ( $f$ ), a projector ( $g$ ), but the online network has an additional predictor ( $q$ ).
Process:
- Given an image $x$ , two augmented views $v$ and $v'$ are created using augmentations $t \sim \mathcal{T}$ and $t' \sim \mathcal{T}'$ .
- The online network processes $v$ : $y_\theta = f_\theta(v)$ , $z_\theta = g_\theta(y_\theta)$ , $p_\theta = q_\theta(z_\theta)$ .
- The target network processes $v'$ : $y'_\xi = f_\xi(v')$ , $z'_\xi = g_\xi(y'_\xi)$ .
- The loss aims to maximize the similarity between the online network's prediction $p_\theta$ and the target network's projection $z'_\xi$ . Specifically, it minimizes the mean squared error between the L2-normalized vectors: $\mathcal{L}_{\theta, \xi} \propto \|\overline{p_\theta} - \overline{z'_\xi}\|_2^2$ , where $\overline{u} = u / \|u\|_2$ .
- The loss is symmetrized by also feeding $v'$ to the online network and $v$ to the target network and adding the resulting loss term.
- Optimization: The online network parameters $\theta$ are updated via gradient descent on the loss $\mathcal{L}^BYOL_{\theta, \xi}$ . Crucially, the gradient is not propagated through the target network (stop-gradient applied to $z'_\xi$ ).
- Target Network Update: The target network parameters $\xi$ are not updated by the optimizer but are an EMA of the online parameters: $\xi \leftarrow \tau \xi + (1 - \tau) \theta$ . The target decay rate $\tau$ is typically scheduled from a value like 0.996 towards 1.0 during training.
Representation: After training, only the online encoder $f_\theta$ is kept and used as the image representation $y = f_\theta(x)$ .

Implementation Details:

Encoder: Standard ResNet architectures (ResNet-50, larger variants).
Projector/Predictor: MLPs (Linear -> BN -> ReLU -> Linear), outputting 256-dimensional vectors.
Augmentations: Similar strong augmentations as SimCLR (random crop, color jitter, grayscale, Gaussian blur, solarization), with some asymmetry between the two views (e.g., blur/solarization applied with different probabilities).
Optimization: LARS optimizer, cosine decay learning rate schedule, large batch size (e.g., 4096), 1000 epochs.

Experimental Results:

ImageNet Linear Evaluation: Achieved SOTA with 74.3% top-1 accuracy using ResNet-50, and 79.6% with ResNet-200 (2x), significantly outperforming prior methods and closing the gap to supervised baselines.
Semi-Supervised Learning: Outperformed previous methods on ImageNet when fine-tuning with 1% or 10% of labels.
Transfer Learning: Demonstrated strong transfer performance, outperforming SimCLR on numerous classification benchmarks (CIFAR, Food101, etc.) and often matching or exceeding supervised pre-training. It also showed SOTA results on semantic segmentation (PASCAL VOC), object detection (PASCAL VOC), and depth estimation (NYUv2).

Ablation Studies & Insights:

Negative Pairs: BYOL works effectively without negative pairs, unlike contrastive methods. Adding negative pairs back did not significantly improve (and could hurt) performance without careful tuning.
Batch Size: Significantly more robust to smaller batch sizes than SimCLR. Performance remained high even when reducing the batch size from 4096 to 256.
Augmentations: Less sensitive to the choice of augmentations than SimCLR. Performance drop was much smaller when removing color jitter or using only random crops.
Target Network: The EMA update is crucial. Instant updates ( $\tau=0$ ) or a fixed target ( $\tau=1$ ) led to collapse or poor performance. Values between 0.9 and 0.999 worked well.
Predictor: The predictor network is essential. Removing it led to collapse, similar to an unsupervised version of Mean Teacher. Experiments suggest maintaining a near-optimal predictor (e.g., via higher learning rate) might be key to preventing collapse, potentially explaining the role of the slow-moving target.

Conclusion:

BYOL presents a novel and effective approach to self-supervised learning that achieves SOTA results without negative pairs. Its success relies on the interaction between an online network predicting the output of a slow-moving average target network, using an asymmetric architecture with a predictor. It demonstrates greater robustness to batch size and augmentation choices compared to contrastive methods.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SeunghyunSEO7/status/1933462435685675131

https://twitter.com/twni2016/status/1751402993558102518

YouTube

Show All Videos