Jigsaw Clustering for Unsupervised Visual Representation Learning (2104.00323v1)

Published 1 Apr 2021 in cs.CV

Abstract: Unsupervised representation learning with contrastive learning achieved great success. This line of methods duplicate each training batch to construct contrastive pairs, making each training batch and its augmented version forwarded simultaneously and leading to additional computation. We propose a new jigsaw clustering pretext task in this paper, which only needs to forward each training batch itself, and reduces the training cost. Our method makes use of information from both intra- and inter-images, and outperforms previous single-batch based ones by a large margin. It is even comparable to the contrastive learning methods when only half of training batches are used. Our method indicates that multiple batches during training are not necessary, and opens the door for future research of single-batch unsupervised methods. Our models trained on ImageNet datasets achieve state-of-the-art results with linear classification, outperforming previous single-batch methods by 2.6%. Models transferred to COCO datasets outperform MoCo v2 by 0.4% with only half of the training batches. Our pretrained models outperform supervised ImageNet pretrained models on CIFAR-10 and CIFAR-100 datasets by 0.9% and 4.1% respectively. Code is available at https://github.com/Jia-Research-Lab/JigsawClustering

Citations (48)

View on Semantic Scholar

Summary

The paper proposes JigClu, a novel single-batch method that reconstructs images from shuffled patches to learn visual representations.
It employs dual branches that cluster patches and predict their positions, leveraging both intra- and inter-image information efficiently.
It achieves 66.4% accuracy on ImageNet linear evaluation and outperforms other single-batch methods, showcasing its data and computational efficiency.

Jigsaw Clustering for Unsupervised Visual Representation Learning

The paper explores the field of unsupervised visual representation learning, proposing an innovative method called Jigsaw Clustering (JigClu) as an alternative to the widely used contrastive learning approaches. The method introduces a unique pretext task grounded in image reconstruction and clustering, which is designed to overcome some of the computational inefficiencies associated with traditional contrastive learning techniques. The focus is on leveraging intra- and inter-image information to improve the learning of visual representations without relying on additional batches during training, effectively reducing computational overhead.

Overview of Jigsaw Clustering Method

JigClu operates by dividing each image in a training batch into multiple patches that are shuffled to form a new batch comprised of montage images. The central goal is to reconstruct the original batch using only the shuffled version, which challenges the network to discern both intra-image and inter-image features. The task is accomplished through two core components: a clustering branch that focuses on associating patches with their respective images and a location branch that predicts the position of each patch in its corresponding image frame.

The clustering branch employs a supervised contrastive learning approach to categorize patches into clusters that correspond to individual images. Meanwhile, the location branch takes on an image-agnostic classification task, determining the location of each patch, effectively guiding the network in its quest to reconstruct the disrupted images accurately. This combination enables more efficient learning, utilizing only a single batch during training while maintaining competitive results with dual-batch methods.

Notable Results and Implications

The proposed method demonstrates a notable improvement over previous single-batch methods. JigClu achieves a 66.4% accuracy on linear evaluation with ImageNet-1k, outperforming the single-batch alternatives by substantial margins. It competes closely with well-established dual-batch methods such as MoCo v2, with comparable accuracy, despite operating under reduced training resource constraints. Moreover, JigClu shows remarkable potential in scenarios involving limited data, further asserting its data efficiency.

When transferred to different datasets such as COCO and CIFAR, JigClu-pretrained models consistently outperform both unsupervised and supervised alternatives, with a significant edge in the CIFAR-100 dataset, where it surpasses supervised weights by 4.1%. The ability to leverage comprehensive feature learning demonstrates its suitability for various vision tasks, including object detection and classification, leading to faster convergence and superior performance in comparison to conventional methods.

Potential Impact and Future Directions

This research illuminates an intriguing pathway for unsupervised visual representation learning by asserting that single-batch methods might rival dual-batch techniques. It opens a new avenue for pursuing efforts towards more efficient and resource-conserving machine learning models. The application of montage images and supervised clustering presents a promising approach that future research can extend into other areas of AI, potentially influencing the design of tasks and networks used in self-supervised learning.

In summary, JigClu's ability to match dual-batch methods in performance while demanding fewer resources reinvigorates discussions on unsupervised learning strategies, encouraging further exploration into single-batch modalities. Researchers are invited to build upon this foundation, examining the broader implications such efficient learning paradigms could have on AI, especially in environments with limited data or computational resources.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (3)

GitHub

GitHub - dvlab-research/JigsawClustering: This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning (78 stars)