Momentum Contrast for Unsupervised Visual Representation Learning (1911.05722v3)

Published 13 Nov 2019 in cs.CV

Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

Authors (5)

Kaiming He (71 papers)
Haoqi Fan (33 papers)
Yuxin Wu (30 papers)
Saining Xie (60 papers)
Ross Girshick (75 papers)

Citations (11,062)

View on Semantic Scholar

Summary

The paper introduces a novel contrastive learning framework that uses a large, dynamic dictionary and momentum-based updates for unsupervised visual representation learning.
It leverages a queue-based strategy to decouple dictionary size from mini-batch size, ensuring diverse negative sampling and consistent key representations through momentum updates and shuffling BN.
Extensive evaluations on ImageNet and larger datasets demonstrate strong transferability to downstream tasks such as object detection, segmentation, and pose estimation.

Momentum Contrast (MoCo) (He et al., 2019 ) addresses the challenge of unsupervised visual representation learning by formulating it as a dictionary look-up task. The core idea is to train an image encoder network so that an encoded query matches its corresponding key in a large dictionary of encoded image representations, while being distinct from all other keys. The paper posits that for this approach to be effective, the dictionary should be both large and consistent.

The dictionary is dynamic, composed of data samples encoded by a network. Training involves minimizing a contrastive loss function, specifically the InfoNCE loss:

$\mathcal{L}_q = -\log \frac{\exp(q{\cdot}k_+ / \tau)}{\sum_{i=0}^{K}\exp(q{\cdot}k_i / \tau)}$

where $q$ is the encoded query, $k_+$ is the encoded positive key, $\{k_i\}_{i=0}^K$ is the set of encoded keys (one positive, $K$ negatives), and $\tau$ is a temperature hyper-parameter. The goal is to make the dot product similarity between $q$ and $k_+$ high, and the similarity between $q$ and negative keys $k_i$ low.

MoCo introduces two mechanisms to build a large and consistent dictionary:

Dictionary as a Queue: Instead of using only samples from the current mini-batch as keys (which limits dictionary size by GPU memory), MoCo maintains the dictionary as a queue of encoded key representations from previous mini-batches. When a new mini-batch is processed, its encoded keys are enqueued, and the oldest keys are dequeued. This structure decouples the dictionary size ( $K$ ) from the mini-batch size, allowing for a much larger and more diverse set of negative samples. The dictionary size can be set independently as a hyper-parameter (e.g., 65536 in experiments), which is significantly larger than typical mini-batch sizes (e.g., 256 or 1024).
Momentum Update for Key Encoder: To maintain consistency among the keys in the queue (which were encoded by the key encoder $f_k$ at different training steps), the parameters of the key encoder ( $\theta_k$ ) are updated using a momentum-based moving average of the query encoder's parameters ( $\theta_q$ ). The query encoder $f_q$ is updated by standard back-propagation from the contrastive loss. The momentum update is:

$\theta_\textrm{k} \leftarrow m \theta_\textrm{k} + (1 - m) \theta_\textrm{q}$

where $m \in [0, 1)$ is the momentum coefficient. A large momentum value (e.g., $m=0.999$ ) ensures that the key encoder evolves slowly and smoothly, keeping the representations of keys in the queue relatively consistent even though they were encoded at different times. This was found to be crucial for good performance.

The standard pretext task used with MoCo in the paper is instance discrimination. Given an image, two different augmented views are created. One view is encoded by the query encoder $f_q$ to produce $q$ , and the other view is encoded by the key encoder $f_k$ to produce the positive key $k_+$ . Negative keys $k_i$ are sampled from the queue.

The pseudocode provided illustrates the implementation in a PyTorch-like style:

f_k.params = f_q.params  # initialize
for x in loader:  # load a minibatch x with N samples
    x_q = aug(x)  # a randomly augmented version
    x_k = aug(x)  # another randomly augmented version

    q = f_q.forward(x_q)  # queries: NxC
    k = f_k.forward(x_k)  # keys: NxC
    k = k.detach()  # no gradient to keys

    # positive logits: Nx1
    l_pos = bmm(q.view(N,1,C), k.view(N,C,1))

    # negative logits: NxK
    l_neg = mm(q.view(N,C), queue.view(C,K))

    # logits: Nx(1+K)
    logits = cat([l_pos, l_neg], dim=1)

    # contrastive loss, Eqn.(1)
    labels = zeros(N)  # positives are the 0-th
    loss = CrossEntropyLoss(logits/t, labels)

    # SGD update: query network
    loss.backward()
    update(f_q.params)

    # momentum update: key network
    f_k.params = m*f_k.params+(1-m)*f_q.params

    # update dictionary
    enqueue(queue, k)  # enqueue the current minibatch
    dequeue(queue)  # dequeue the earliest minibatch

An important implementation detail is "Shuffling BN". Standard Batch Normalization layers inside the encoders can cause information leakage within a mini-batch, allowing the model to solve the pretext task by recognizing sub-batch statistics rather than learning meaningful visual representations. To prevent this, the samples for the key encoder are shuffled across GPUs before being processed by BN, ensuring that the batch statistics used for a query and its positive key come from different subsets of samples.

The paper evaluates MoCo extensively through unsupervised pre-training on ImageNet-1M and the larger Instagram-1B dataset, followed by either linear classification on frozen features or fine-tuning on various downstream tasks.

For linear classification on ImageNet, MoCo achieves competitive results, outperforming previous methods with similar model sizes. Ablations confirm that both a large dictionary size (enabled by the queue) and a high momentum (for consistency) are crucial for performance.

More significantly, the paper demonstrates the transferability of MoCo-learned features to a range of downstream tasks, including object detection (PASCAL VOC, COCO), instance segmentation (COCO, LVIS, Cityscapes), keypoint detection (COCO), and dense pose estimation (COCO). In many of these tasks, MoCo pre-trained features either match or surpass their counterparts pre-trained with ImageNet supervised learning. For example:

On PASCAL VOC object detection, MoCo pre-trained on ImageNet-1M is comparable to supervised pre-training, and pre-training on Instagram-1B surpasses it, especially in COCO-style AP metrics.
On COCO detection and segmentation, MoCo consistently outperforms ImageNet supervised pre-training, with larger gains observed when fine-tuning for longer schedules.
On tasks like COCO keypoint detection and dense pose estimation, MoCo shows noticeable improvements over supervised pre-training.
On LVIS instance segmentation (a dataset with long-tailed distributions), MoCo with Instagram-1B pre-training also surpasses the supervised baseline.
MoCo performs comparably or better than supervised pre-training on Cityscapes semantic segmentation but slightly lags on VOC semantic segmentation.

The consistent improvement observed when pre-training on the larger, uncurated Instagram-1B dataset compared to ImageNet-1M highlights MoCo's ability to leverage large-scale data effectively in a real-world scenario.

From an implementation perspective, MoCo provides a general framework for contrastive learning that can utilize standard network architectures (like ResNet) without modifications specific to the pretext task, making transfer to downstream tasks straightforward. The use of a queue requires managing a large buffer of features, and the momentum update adds a simple layer of complexity to the training loop. The Shuffling BN trick is essential when using BN layers in the encoders.

In summary, MoCo provides a practical and effective approach for unsupervised visual representation learning by building large and consistent dynamic dictionaries for contrastive learning. Its strong performance across various downstream tasks, often matching or exceeding supervised pre-training, suggests that it significantly reduces the gap between unsupervised and supervised methods and offers a viable alternative for pre-training large-scale vision models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/pszwnzl/status/1772367207885926841

YouTube

Show All Videos