SCAN: Learning to Classify Images without Labels
(2005.12320v2)
Published 25 May 2020 in cs.CV and cs.LG
Abstract: Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task from representation learning is employed to obtain semantically meaningful features. Second, we use the obtained features as a prior in a learnable clustering approach. In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches. Experimental evaluation shows that we outperform state-of-the-art methods by large margins, in particular +26.6% on CIFAR10, +25.0% on CIFAR100-20 and +21.3% on STL10 in terms of classification accuracy. Furthermore, our method is the first to perform well on a large-scale dataset for image classification. In particular, we obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations. The code is made publicly available at https://github.com/wvangansbeke/Unsupervised-Classification.
The paper presents SCAN, a novel unsupervised image classification method that decouples feature learning from clustering through a two-step process.
It employs a self-supervised pretext task to learn invariant features and a clustering network with a tailored SCAN loss to ensure consistent, confident predictions.
Results demonstrate substantial accuracy gains on datasets like CIFAR10 and STL10 and competitive performance on ImageNet, underscoring its practical impact.
This paper introduces SCAN (Semantic Clustering by Adopting Nearest neighbors), a novel two-step approach for unsupervised image classification, aiming to automatically group images into meaningful semantic clusters without using ground-truth labels (Gansbeke et al., 2020). The core idea is to decouple feature learning from clustering, addressing limitations of previous end-to-end methods which often rely on network initialization and can latch onto low-level features, and traditional representation learning methods that use suboptimal clustering techniques like K-means post-feature extraction.
Methodology:
Step 1: Pretext Task Feature Learning:
A self-supervised pretext task (e.g., instance discrimination like MoCo or SimCLR) is used to train an embedding function Φθ.
Crucially, the chosen pretext task should satisfy an invariance criterion (Eq. 1), ensuring that an image Xi and its augmentations T[Xi] are mapped closely in the feature space: θmind(Φθ(Xi),Φθ(T[Xi])). This promotes the learning of semantically meaningful, transformation-invariant features.
After training Φθ, it is used to extract feature representations for all images in the dataset D.
Step 2: Learnable Clustering with Nearest Neighbor Prior:
Mining Neighbors: For each image Xi, its K nearest neighbors NXi are identified in the feature space learned in Step 1. The paper empirically shows that these neighbors often belong to the same semantic class.
SCAN Loss: A new clustering network Φη (with a final softmax layer) is trained using the SCAN loss function:
The first term maximizes the dot product between the softmax predictions of an image X and its neighbors k∈NX. This encourages the network to produce consistent (same cluster assignment) and confident (one-hot) predictions for semantically similar images identified by the pretext task.
The second term is an entropy regularization term applied to the average prediction distribution over the dataset. It encourages cluster assignments to be uniformly distributed, preventing the model from collapsing into assigning all samples to a single cluster.
Self-Labeling Fine-tuning: To mitigate noise from potential false positives in the mined neighbors, a self-labeling step is performed after the initial clustering. Samples for which the network Φη produces highly confident predictions (softmax probability > threshold) are selected. Pseudo-labels are assigned based on the predicted cluster, and the network is fine-tuned using a cross-entropy loss on strongly augmented versions of these confident samples. This allows the network to refine its decision boundaries based on its most certain predictions.
Implementation Details:
Backbone: ResNet-18 for smaller datasets, ResNet-50 for ImageNet.
Pretext Task: Instance discrimination using SimCLR implementation for smaller datasets, MoCo for ImageNet. K=20 neighbors mined.
Clustering Step: Trained for 100 epochs (smaller datasets), using Adam optimizer, batch size 128, entropy weight λ=5. Strong augmentations (RandAugment) applied.
Self-Labeling Step: Trained for 200 epochs (smaller datasets), threshold 0.99, weighted cross-entropy, Adam optimizer. Strong augmentations (RandAugment) applied. For ImageNet, SGD optimizer used.
Evaluation: Standard metrics like Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) are used, often employing the Hungarian matching algorithm to map predicted clusters to ground-truth classes.
Experiments and Results:
Ablation Studies: Demonstrated the superiority of the two-step approach over K-means on pretext features, the effectiveness of the SCAN loss, the benefit of strong augmentations (RandAugment), and the significant improvement from self-labeling. Showed that pretext tasks satisfying the invariance criterion (Instance Discrimination) work better than those that don't (RotNet). Performance was relatively stable for different values of K (number of neighbors).
State-of-the-Art Comparison: SCAN significantly outperformed previous unsupervised classification methods on CIFAR10 (+26.6% ACC), CIFAR100-20 (+25.0% ACC), and STL10 (+21.3% ACC). It achieved performance close to supervised levels on CIFAR10 and STL10.
Overclustering: Showed robustness when the number of clusters C was overestimated (e.g., using 20 clusters for CIFAR10), with performance remaining high and even improving on CIFAR100-20, suggesting it can handle datasets with high intra-class variance.
ImageNet: Achieved strong results on ImageNet subsets (50, 100, 200 classes) and the full 1000-class dataset, outperforming K-means on pretext features. Qualitative results showed semantically coherent clusters and meaningful "prototypes" (highly confident samples). Notably, SCAN outperformed several semi-supervised methods using 1% labeled ImageNet data, despite using no labels itself.
Conclusion:
SCAN presents an effective framework for unsupervised image classification by separating representation learning from clustering and using nearest neighbors derived from self-supervised features as a prior. The SCAN loss encourages consistent and confident predictions, while self-labeling refines the results. The method achieves state-of-the-art performance across multiple benchmarks, including large-scale datasets like ImageNet, demonstrating the power of decoupling and leveraging high-quality self-supervised representations for clustering.