What Should Not Be Contrastive in Contrastive Learning (2008.05659v2)

Published 13 Aug 2020 in cs.CV

Abstract: Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.

Authors (4)

Tete Xiao (19 papers)
Xiaolong Wang (243 papers)
Trevor Darrell (324 papers)
Alexei A. Efros (100 papers)

Citations (284)

View on Semantic Scholar

Summary

Insights into "What Should Not Be Contrastive in Contrastive Learning"

The paper "What Should Not Be Contrastive in Contrastive Learning" introduces a novel framework in contrastive learning algorithms that aims to address a critical limitation in existing models: the assumption of specific representational invariances. This paper's work challenges the traditional belief that every transformation used in data augmentation should lead to the invariance of a contrastive model.

Core Contributions

The authors propose a new framework called Leave-one-out Contrastive Learning (LooC), which is designed to learn visual representations that capture both invariant and variant factors associated with specific data augmentations. The primary innovation here is to construct multiple embedding spaces rather than a single embedding space. Each of these spaces is sensitive to one type of augmentation while being invariant to others. This approach is implemented using a multi-head network with shared backbones, effectively capturing and leveraging information across different augmentation styles without assuming prior knowledge of invariances required by downstream tasks.

Evaluation and Numerical Results

The framework was tested on various datasets including ImageNet-100, iNaturalist, CUB-200, and others, spanning tasks like fine-grained classification, few-shot learning, and general robustness on corrupted data. The results show consistent improvements in several metrics. For instance, the LooC framework showed around a 10% increase over the state-of-the-art MoCo baseline in classification tasks on the iNaturalist dataset. This performance boost illustrates the model's strong transferability and generalization capabilities without requiring specialized hand-crafted data augmentation strategies.

Implications and Future Directions

The paper's findings hold significant implications for the design and implementation of contrastive learning models. By efficiently separating the varying and invariant factors in visual representations, LooC enables the creation of more versatile models that can adapt to diverse downstream tasks. This has practical implications in fields like autonomous driving, where distinguishing between rotation and perspective is crucial.

Furthermore, this work opens up new directions in unsupervised and semi-supervised learning. Future research could focus on expanding the types of augmentations and understanding their interactions in various domains. Moreover, exploring this framework's application to other forms of data beyond visual, such as audio or text, could further broadify its impact.

Conclusion

The paper contributes a significant step forward in contrastive learning by highlighting the importance of managing augmentation-induced biases. The proposed LooC framework shows robust performance across multiple tasks and datasets, challenging the conventional approach of assuming uniform augmentation invariances. This work stands as a foundation for further innovations in learning mechanisms that adaptively leverage multi-view information without stringent assumptions on data transformation invariances.

PDF Markdown

Related Papers

Find Related Papers