SwAV: Self-Supervised Visual Learning

Updated 11 June 2026

SwAV is a self-supervised method that reinterprets visual representation learning as an online clustering problem using learnable prototypes, enabling efficient training.
It employs a swapped prediction mechanism with optimal transport and Sinkhorn iterations to ensure assignment consistency across multiple image augmentations.
Empirical results show that SwAV delivers competitive ImageNet performance and robust transferability, with extensions to semi-supervised learning and domain adaptation.

SwAV (Swapping Assignments between Views) is a self-supervised visual representation learning method that replaces traditional pairwise contrastive comparisons with online clustering and prototype consistency across data augmentations. By leveraging a "swapped prediction" mechanism and optimal-transport-based assignment, SwAV achieves scalable, memory-efficient training while avoiding feature collapse through architectural asymmetry and assignment-based regularization.

1. Conceptual Foundations and Objective

SwAV reinterprets self-supervised learning as an online clustering problem. Rather than relying on instance discrimination via explicit positive/negative feature comparisons, SwAV introduces a set of learnable prototype vectors $\{c_k\}_{k=1}^K$ in feature space. Each view of an image is mapped to a normalized feature, which is then softly assigned to prototypes by solving a balanced optimal transport problem: for a minibatch $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ and prototype matrix $C\in\mathbb{R}^{d\times K}$ , the assignments $Q^* \in [0,1]^{K \times B}$ are determined via

$Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$

subject to marginal constraints enforcing uniform prototype and batch usage (enforced via the Sinkhorn–Knopp algorithm) (Caron et al., 2020).

The core learning objective is to enforce consistency of cluster assignments across different augmentations (views) of the same image. For two views $z$ and $z'$ , SwAV computes their assignments (codes) $q, q'$ , and the loss penalizes discrepancies between the predicted assignment of one view and the target assignment of the other ("swapped prediction"): $L_\text{SwAV} = -\frac{1}{N} \sum_{n=1}^N \left[\ell_\text{swap}(z_{nt}, q_{ns}) + \ell_\text{swap}(z_{ns}, q_{nt})\right]$ where $\ell_\text{swap}$ is the swapped prediction cross-entropy (Bartler et al., 2022).

2. Rank Differential Mechanism and Collapse Avoidance

A central theoretical lens on SwAV is provided by the Rank Differential Mechanism (RDM) (Zhuo et al., 2023), which unifies the collapse-avoidance strategy across non-contrastive learning methods. Let $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 0 and $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 1 denote the output feature correlation matrices for the online branch and the target branch (with $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 2 in SwAV).

RDM Principle: Successful self-supervised methods, including SwAV, maintain a strict effective rank difference $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 3 throughout training, where

$Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 4

and $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 5 are eigenvalues of $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 6.

SwAV's assignment-based asymmetry with the Sinkhorn operation serves as a spectral high-pass filter on the target branch, enforcing $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 7 and, by Theorem 4.5 in (Zhuo et al., 2023), provably averting representational collapse.

Key lemmas detailing this include:

Eigenspace Alignment: The minimizer of the alignment loss diagonalizes $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 8 and $Z=[z_1,\ldots,z_B] \in \mathbb{R}^{d \times B}$ 9 in the same basis.
Spectral Filter View: The online branch is a filtered version of the target branch, $C\in\mathbb{R}^{d\times K}$ 0, with $C\in\mathbb{R}^{d\times K}$ 1 monotonic.
Low-pass $C\in\mathbb{R}^{d\times K}$ 2 Rank Difference: If $C\in\mathbb{R}^{d\times K}$ 3 is non-constant and monotonically increasing, then $C\in\mathbb{R}^{d\times K}$ 4.
Dynamics: As training progresses, $C\in\mathbb{R}^{d\times K}$ 5 increases so long as $C\in\mathbb{R}^{d\times K}$ 6 (Zhuo et al., 2023).

Empirically, SwAV exhibits substantial and persistent $C\in\mathbb{R}^{d\times K}$ 7 compared to other non-contrastive methods, with both effective ranks growing after warmup.

3. Training Pipeline, Architecture, and Algorithmic Steps

SwAV employs a convolutional backbone (typically ResNet-50) and an MLP projection head, followed by a learnable prototype matrix ( $C\in\mathbb{R}^{d\times K}$ 8 is standard for ImageNet) (Caron et al., 2020, Lahrichi et al., 15 Feb 2025). The algorithm proceeds as follows:

Data Augmentation: For each image, generate $C\in\mathbb{R}^{d\times K}$ 9 augmented views (including multi-crop, e.g. two global and several local crops).
Feature Extraction: Pass each view through encoder + projection head; normalize resulting features.
Prototype Assignment: For each view, compute similarity scores to prototypes, then obtain soft assignments by running the Sinkhorn–Knopp algorithm under constraints that equally distribute assignments across prototypes and samples.
Loss Computation: For all pairs of views, compute swapped-prediction cross-entropy between the predicted assignment of one view and the Sinkhorn code of the other.
Backpropagation: Update encoder, projector, and prototypes. Prototypes are $Q^* \in [0,1]^{K \times B}$ 0-normalized after each update.
Multi-Crop: Assignments for local views are predicted using cluster assignments from global views, improving granularity and efficiency (Caron et al., 2020).

Hyperparameters: Representative values include batch size $Q^* \in [0,1]^{K \times B}$ 1 (for large-scale training), temperature $Q^* \in [0,1]^{K \times B}$ 2, Sinkhorn regularizer $Q^* \in [0,1]^{K \times B}$ 3, learning rate scheduling with LARS optimizer, 2-layer MLP projection head (output 128-d), $Q^* \in [0,1]^{K \times B}$ 4 prototypes (Caron et al., 2020).

4. Empirical Performance and Applications

SwAV demonstrates competitive or superior performance across standard computer vision benchmarks (Caron et al., 2020, Lahrichi et al., 15 Feb 2025):

ImageNet Linear Evaluation: SwAV achieves 75.3% top-1 accuracy (ResNet-50, 800 epochs), outperforming SimCLR (70%), MoCo-v2 (71.1%), and matching supervised pretraining (76.5%).
Transfer Learning: Outperforms supervised baseline on Places205, PASCAL VOC07 (mAP: 88.9 vs 87.5), iNat18, and COCO detection tasks.
Semi-Supervised Learning: As part of "Suave," SwAV prototypes serve as semantic class centers, yielding state-of-the-art accuracy in low-label regimes on both CIFAR-100 (81.6% with 100 labels per class) and ImageNet (Fini et al., 2023).
Remote Sensing: SwAV pre-training on both ImageNet and Sentinel-2 (GeoNet) yields comparable few-shot downstream performance in land cover and classification tasks, often outperforming supervised ImageNet pretraining. The performance gap between in-domain and out-of-domain pre-training is usually within 0–4% (Lahrichi et al., 15 Feb 2025).

Test-Time Adaptation: TTAPS demonstrates that SwAV-trained prototypes can be leveraged for fast per-sample adaptation under distribution shift, restoring classification accuracy on corrupted data (CIFAR10-C, 80.1% vs 72.9% for supervised baselines) (Bartler et al., 2022).

5. Architectural Variants and Extensions

SwAV admits a range of modifications justified by the RDM theory (Zhuo et al., 2023). The assignment-based asymmetry may be tuned via:

Number of Sinkhorn–Knopp iterations.
Temperature sharpening ( $Q^* \in [0,1]^{K \times B}$ 5) to increase high-pass filtering.
Replacing Sinkhorn with explicit spectral filters on the target or online branch.

A high-level pseudo-code for an RDM-inspired SwAV variant replaces Sinkhorn with a spectral high-pass operator $Q^* \in [0,1]^{K \times B}$ 6 applied to singular values of code representations; choices such as $Q^* \in [0,1]^{K \times B}$ 7 for $Q^* \in [0,1]^{K \times B}$ 8 preserve the desired positive rank differential. Empirical studies confirm these variants match or exceed original SwAV in practice (Zhuo et al., 2023).

SwAV also integrates naturally with semi-supervised workflows: setting the prototype count $Q^* \in [0,1]^{K \times B}$ 9 equal to the number of classes, the prototypes act as both clusters for unlabeled data and class centroids for labeled samples. A unified cross-entropy loss is minimized across labeled and unlabeled samples, without need for balancing hyperparameters (Fini et al., 2023).

6. Limitations and Comparative Insights

SwAV's computational efficiency is notable: it avoids the large memory banks or momentum encoders characteristic of instance-based contrastive methods, requires only moderate additional cost for multi-crop augmentation, and converges significantly faster at scale (e.g., 72% in 6 hours versus ≈40 hours for SimCLR) (Caron et al., 2020).

However, certain limitations persist:

The need to tune prototype count $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 0, temperature $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 1, and Sinkhorn regularizer $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 2 for each dataset and batch regime.
Prototypes are fixed in number; very small or large $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 3 can be suboptimal.
Balanced assignment via Sinkhorn incurs a small computational overhead per batch.
In remote sensing, domain-aligned SwAV pre-training (e.g., Sentinel-2) provides only modest incremental benefit over ImageNet pre-training, especially when only RGB bands are used (Lahrichi et al., 15 Feb 2025).
Test-time adaptation using SwAV prototypes requires careful architectural choices (e.g., adaptation restricted to last encoder block), group normalization, and parameter resets for stability (Bartler et al., 2022).

A plausible implication is that, while SwAV's assignment-based approach is widely applicable and robust, further benefit from domain-aligned SSL datasets may require richer augmentations or additional channels beyond standard RGB, particularly in domains with complex variability (Lahrichi et al., 15 Feb 2025).

7. Summary Table: SwAV Core Components

Component	Standard Setting (ImageNet, ResNet-50)	References
Prototypes ( $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 4)	3,000	(Caron et al., 2020)
Batch size ( $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 5)	4,096 (large); 256 + FIFO queue (small)	(Caron et al., 2020)
Augmentation	Multi-crop (2 global, 4–6 local views)	(Caron et al., 2020)
Sinkhorn iterations	3–5	(Caron et al., 2020)
Projection head	2-layer MLP, 128-d output	(Caron et al., 2020)
Optimizer	LARS / SGD + momentum	(Caron et al., 2020, Lahrichi et al., 15 Feb 2025)
Temperature ( $Q^* = \arg\max_{Q \geq 0} \langle Q,\, C^\top Z\rangle + \epsilon H(Q)$ 6)	0.1	(Caron et al., 2020)
Pre-train epochs	400–800	(Caron et al., 2020, Lahrichi et al., 15 Feb 2025)

SwAV provides a scalable, theoretically justified, and empirically validated framework for self-supervised learning by enforcing view-consistency via swapped assignment of learnable prototypes, robust to batch size and easily extensible to semi-supervised, test-time adaptation, and domain transfer scenarios (Caron et al., 2020, Zhuo et al., 2023, Bartler et al., 2022, Lahrichi et al., 15 Feb 2025, Fini et al., 2023).