SwAV: Self-Supervised Visual Clustering

Updated 5 November 2025

The paper introduces a novel self-supervised method that swaps cluster assignments between views to enforce transformation invariance and prevent collapse.
SwAV employs online optimal transport for balanced soft clustering combined with multi-crop augmentation, enabling scalable training with small batch sizes.
Empirical results show that SwAV outperforms traditional contrastive methods on benchmarks like ImageNet and in applications such as medical imaging.

SwAV (Swapping Assignments between Views) is a self-supervised vision representation learning framework introduced to address the limitations of conventional contrastive learning paradigms. It achieves this by leveraging online clustering and a swapped prediction mechanism to learn transformation-invariant features efficiently, without requiring explicit negative pairs or large batch sizes. SwAV and its adaptations have impacted a broad range of domains, from large-scale web image learning to medical imaging and structured regression tasks.

1. Methodological Foundations and Technical Formulation

SwAV is situated within self-supervised learning, aiming to extract high-quality features from unlabelled data. It introduces several interlocking algorithmic components:

Multi-crop augmentation: Given an image $\mathbf{x}_n$ , generate $V+2$ views, combining two global crops (high resolution) and $V$ local crops (lower resolution). This multi-crop policy expands the set of transformations under which invariance is encouraged, while maintaining computational efficiency.
Feature projection: Each augmented view $\mathbf{x}_{nt}$ is passed through an encoder and projection head to yield a normalized vector:

$\mathbf{z}_{nt} = f_\theta(\mathbf{x}_{nt}) \,,\quad \mathbf{z}_{nt} \leftarrow \frac{\mathbf{z}_{nt}}{\| \mathbf{z}_{nt}\|_2}$

Clustering via optimal transport: Maintain $K$ learnable prototype vectors $\{ \mathbf{c}_k \}_{k=1}^K$ . Assign features to prototypes via a soft assignment matrix $\mathbf{Q}$ by solving an entropy-regularized optimal transport problem:

$\max_{\mathbf{Q} \in \mathcal{Q}} \operatorname{Tr}(\mathbf{Q}^\top \mathbf{C}^\top \mathbf{Z}) + \varepsilon H(\mathbf{Q})$

subject to batch-wise equipartition constraints

$\mathcal{Q} = \{ \mathbf{Q} \geq 0~|~\mathbf{Q}\mathbf{1}_B = \frac{1}{K}\mathbf{1}_K,\, \mathbf{Q}^\top\mathbf{1}_K = \frac{1}{B}\mathbf{1}_B \}$

This is efficiently implemented with the Sinkhorn-Knopp algorithm, enforcing balanced usage of all prototypes within each batch (Caron et al., 2020, Assran et al., 2022).

Swapped prediction objective: For any two views $s, t$ of the same image, SwAV optimizes the cross-entropy between feature-prototype assignments:

$L(\mathbf{z}_t, \mathbf{z}_s) = \ell(\mathbf{z}_t, \mathbf{q}_s) + \ell(\mathbf{z}_s, \mathbf{q}_t)$

where

$\ell(\mathbf{z}, \mathbf{q}) = - \sum_{k} \mathbf{q}^{(k)} \log \mathbf{p}^{(k)} \,,\qquad \mathbf{p}^{(k)} = \frac{\exp( \frac{1}{\tau} \mathbf{z}^\top \mathbf{c}_k )}{\sum_{k'} \exp( \frac{1}{\tau} \mathbf{z}^\top \mathbf{c}_{k'} )}$

The loss is summed over all image batches and view pairs. Soft assignments are crucial—hard assignments degrade performance (Caron et al., 2020).

Optimization: Jointly adapt prototype and encoder parameters via SGD. Prototypes are normalized after each update to stabilize training.

This architecture prevents trivial collapse by both the equipartition constraint and the swapping of assignments, without requiring negatives, momentum encoders, or large memory banks (Jha et al., 22 Feb 2024).

2. Positioning with Respect to Contrastive and Clustering-based SSL

Traditional contrastive methods (e.g., SimCLR, MoCo) rely on maximizing agreement between positive pairs (different augmentations of the same example) while repelling all other examples (“negatives”). This requires either large global batch sizes or persistent feature banks/queues, which scale poorly.

SwAV departs by:

Eschewing explicit negatives: The cluster structure and assignment-swap constraint serve to regularize the representation space, achieving similar effect as negative pairs but without quadratic pairwise computations (Caron et al., 2020, Si et al., 19 Aug 2025).
Providing a unifying constraint: SwAV’s assignment consistency brings it under the Generalized Learning Framework (GLF) for SSCL, where the “aligning” part encourages consistency of representation under augmentation, and the “constraining” part imposes a clustering prior (Si et al., 19 Aug 2025).
Soft cluster assignments as codes: Unlike DeepCluster/DINO, which may use off-line clustering, SwAV uses balanced, online clustering per mini-batch as an intrinsic part of the loss layer (Caron et al., 2020).

This positions SwAV as a unification of contrastive, clustering, and joint-embedding methods, exhibiting improved memory efficiency and stability.

3. Mathematical Mechanisms for Collapse Prevention and Inductive Bias

SwAV’s most critical inductive mechanism is the enforced uniform partitioning of assignments:

Every prototype is equally likely to be assigned within a batch, as enforced by Sinkhorn-normalized optimal transport steps (Assran et al., 2022, Jha et al., 22 Feb 2024).
The mean representation (center vector) is implicitly kept close to zero, serving as a powerful anti-collapse regularizer (Jha et al., 22 Feb 2024). Even with fixed, non-adaptive prototypes, uniform initialization on a unit sphere is sufficient to avoid collapse; however, learnable prototypes yield superior downstream performance.
Empirically, this constraint remains effective up to very large model and dataset scales, provided the number of prototypes matches semantic diversity (Goyal et al., 2021).

There is, however, a recognized limitation: the uniform prior may mismatch non-uniform, real-world label distributions, degrading representation quality on long-tailed datasets (Assran et al., 2022). Using power-law or empirically matched priors for cluster assignment improves feature quality on imbalanced data.

4. Augmentation Strategies and Efficiency Innovations

The multi-crop augmentation policy is central:

By combining two high-resolution “global” views with multiple small, local crops, SwAV achieves stronger transformation invariance (object/part relations) and increased data efficiency, while incurring minimal additional memory/computational burden (Caron et al., 2020).
The multi-crop strategy is especially impactful in representations requiring robustness to spatial scale or structure, such as medical imaging (Margapuri, 9 Dec 2024) and medical signal modalities (Soltanieh et al., 2023).

Empirical ablations show that augmenting the crop policy with semantics-aware background augmentations further increases robustness to distributional shift and improves limited-label accuracy (Ryali et al., 2021).

SwAV enables competitive training at small batch sizes or with low intra-batch diversity by maintaining only a shallow assignment queue, rather than a full memory bank, making it scalable to resource-constrained settings (Ciga et al., 2021, Caron et al., 2020). In the resource-poor regime (e.g., small data, small images), SwAV’s efficiency advantage is more pronounced.

5. Empirical Performance and Application Domains

Image Classification: SwAV achieves strong linear probe performance: 75.3% top-1 accuracy on ImageNet with ResNet-50, exceeding or matching supervised pretraining, and outperforming SimCLR (70.0%), MoCo-v2 (71.1%), PIRL, and DeepCluster (Caron et al., 2020). Unlike methods relying strictly on instance discrimination, SwAV features transfer well to a broad suite of downstream tasks.

Transfer and Domain Generalization: SwAV pretraining on random, uncurated web-scale images produces features that transfer robustly. In the SEER pipeline, a RegNetY-256GF model trained on 1B Instagram images achieved 84.2% ImageNet accuracy—on par with or exceeding supervised and prior SSL SOTA—even though the pretraining dataset was entirely unlabelled and uncurated (Goyal et al., 2021). SwAV also demonstrates robust OOD generalization in medical domains such as ECG arrhythmia detection, consistently outperforming SimCLR and BYOL (Soltanieh et al., 2023).

Continual and Semi-supervised Learning: SwAV-based pretraining confers significant benefit in online continual learning (OCL) for class-incremental scenarios, yielding higher final accuracy and reduced forgetting compared to supervised initialization. These benefits are amplified as the pretraining label set shrinks (Gallardo et al., 2021). In semi-supervised adaptations, SwAV forms the kernel of "Suave," where labeled and unlabeled data are integrated through a unified cluster-class prototype mechanism, achieving state-of-the-art semi-supervised results (Fini et al., 2023).

Medical Imaging and Structured Regression: SwAV's multi-crop and cluster invariance mechanisms are particularly suited for medical images with multi-scale and variable-structure abnormalities (Margapuri, 9 Dec 2024), and have been shown to outperform supervised and alternative SSL methods on ulcerative colitis grading and ECG disease detection.

6. Variants, Extensions, and Limitations

Equivariance Adaptations: In regression tasks like gaze estimation, SwAV’s invariance under augmentation is suboptimal. SwAT adapts SwAV by introducing an explicit feature transform layer, enforcing equivariance under geometric transforms (e.g., flips, rotations), yielding substantial performance gains in cross-dataset gaze estimation (Farkhondeh et al., 2022).

Interpretability and Spatial Attention: SwAV is agnostic to spatial correspondence by default. Augmentations such as spatial cross-attention modules can be added during training to promote feature localization and improve interpretability, as reflected in improved class activation maps (CAMs), saliency, and slight improvements on linear probing and object detection (Seyfi et al., 2022).

Prior Mismatch Issues: Empirical studies confirm that SwAV’s uniform cluster prior limits performance on class-imbalanced, real-world distributions, as the model will align features with low-level properties rather than rare semantic classes. This can be mitigated by matching priors to the class distribution (e.g., power-law, empirical class histogram) (Assran et al., 2022).

A notable caveat is that SwAV’s local (view-to-view or crop-level) alignment alone does not induce explicit inter-class separability. Extensions such as adaptive distribution calibration directly enforce intra-class compactness and inter-class separation, surpassing vanilla SwAV in generalized support for downstream discriminability (Si et al., 19 Aug 2025).

7. Summary Table: Core SwAV Workflow

Component	Mechanism	Constraint/Purpose
Data Augmentation	Multi-crop (global and local views)	Transformation invariance
Feature Projection	Normalized encoder + MLP head	Dimensionality reduction
Assignment to Prototypes	Soft allocation via Sinkhorn-Knopp OT	Equipartition, anti-collapse
Training Objective	Swapped cross-entropy of assignments	Invariant feature learning
Optimization	Joint on encoder and prototypes	Efficiency, stability
Deployment	No negatives, minimal memory queue	Scalability

8. Conclusion

SwAV introduces a cluster-based self-supervised learning paradigm, employing online optimal transport assignments and a swapped prediction mechanism to address the scalability, stability, and memory limitations intrinsic to contrastive learning. Its principled use of multi-crop augmentation, soft assignment codes, and prototype normalization ensures transformation-invariant feature learning with minimal risk of representation collapse. Empirically, SwAV is competitive or superior across mainstream benchmarks and diverse application domains. Limitations arising from the universality of its cluster prior and lack of explicit class-separation have given rise to extensions that further improve downstream discriminability, domain adaptability, and sample efficiency.