Homomorphic Self-Supervised Learning
- H-SSL is a self-supervised representation learning framework that unifies augmentation-based and augmentation-free methods through equivariant (homomorphic) encoders.
- It reformulates contrastive objectives in feature space by employing group-structured operations and fiber sampling to mimic data augmentations without explicit transformations.
- Empirical evaluations show that H-SSL achieves comparable performance to traditional methods using equivariant architectures, while non-equivariant models fail to capture its benefits.
Homomorphic Self-Supervised Learning (H-SSL) is a general framework for self-supervised representation learning that unifies augmentation-based and augmentation-free paradigms through the lens of equivariant (homomorphic) encoders. By design, H-SSL subsumes popular contrastive objectives and many traditional self-supervised losses when the feature extractor is augmentation-homomorphic with respect to a group of data transformations (Keller et al., 2022).
1. Formal Definition and Theoretical Foundation
Let denote the input space (e.g., images), and let be a discrete or continuous group of "augmentations" (e.g., translations, rotations, scalings). Consider a feature extractor and a group representation . is augmentation-homomorphic (equivariant) if, for all and ,
where denotes the action of on 0 in input space, and 1 is its corresponding "lifted" action in feature space. Since 2 preserves the group structure, 3 is a homomorphism from 4 to 5.
This formulation enables H-SSL to operate directly in representation space using group-structure-respecting operations, rather than requiring explicit data augmentations.
2. Derivation of Self-Supervised Losses within H-SSL
Standard augmentation-based self-supervised learning (A-SSL)—such as SimCLR—uses contrastive objectives by sampling paired augmented versions of inputs. For a batch 6, with two independent augmentations 7 and projection head 8, the SimCLR loss is
9
When 0 is 1-equivariant, this loss can be restated purely in feature-space by leveraging the homomorphic property: 2 Each input thus yields a single feature map 3, with "fibers" 4. H-SSL positives are constructed by sampling fibers 5 and 6 from 7, with 8 a base-set of prescribed size.
The H-SSL loss is then: 9 This form is algebraically identical to the A-SSL loss under equivariance, rendering A-SSL a special case of H-SSL.
Selection of 0 and 1 recovers various objectives; for instance, 2 as the spatial-translation group with 3 a single location reduces to the local DIM(L) or Greedy InfoMax objectives, and CPC emerges for appropriate choices of heads.
3. Equivalence Conditions and Failure Modes
Proposition 3.1 states that if 4 is exactly 5-equivariant, the A-SSL and H-SSL losses are identical under fiber sampling in feature-space. The proof consists of substituting 6 and re-indexing over the group 7.
Corollary 3.2 establishes a critical failure: if 8 is not homomorphic—i.e., 9 and 0 do not commute—then H-SSL cannot simulate the effect of input augmentations. Empirically, this results in representations that collapse or perform at chance level, in contrast to A-SSL, which still benefits from explicit input transformations.
4. Empirical Evaluation
Experimental validation employed three groups: rotation (four 1 steps), translation (±20% shifts), and scale (six downscaling factors). Equivariant backbones were used for each group (e.g., rotation-equivariant CNN, standard CNN, SESN).
- Datasets: MNIST, CIFAR-10, Tiny ImageNet.
- Analysis: Linear-probe accuracy after SSL pretraining was measured for both A-SSL (with explicit augmentation) and H-SSL (feature-space only).
Key findings are summarized in the following table:
| Augmentation Group | CIFAR-10 (A-SSL) | CIFAR-10 (H-SSL) |
|---|---|---|
| Translation | 39.2 ± 0.5% | 36.3 ± 1.1% |
This close correspondence confirms the theoretical equivalence of A-SSL and H-SSL when equivariant architectures are used.
When replacing equivariant layers with generic MLPs or non-equivariant CNNs, H-SSL performs at random or "frozen" levels, demonstrating the necessity of equivariance for feature-space contrastive pairs.
5. Relationship to Augmentation-Based SSL and Parameter Space
H-SSL introduces two key hyperparameters absent from vanilla A-SSL:
- The base-set size 2, determining how many fibers are grouped as a "view."
- The topographic distance between 3, 4 in 5, controlling the effective "augmentation strength."
Empirical variation shows that increasing 6 interpolates smoothly from local (DIM(L)) to global (SimCLR) losses, yielding only minor changes in downstream linear probe accuracy (within ±2% on CIFAR-10). Similarly, increasing the maximum allowable topographic distance 7 improves then degrades performance, paralleling findings in A-SSL regarding the utility of strong augmentations.
Temperature 8, embedding dimensionality, and projection head size remain as in SimCLR, confirming that H-SSL generalizes existing A-SSL setups.
6. Broader Implications and Directions
Homomorphic Self-Supervised Learning offers several conceptual and practical advantages:
- Unified Perspective: Provides a principled bridge between augmentation-based and augmentation-free SSL, subsuming contrastive, alignment, uniformity, and local InfoMax variants as instantiations of a single group-equivariant InfoNCE objective.
- Generalization: Admits any 9 for which an equivariant feature extractor is available, potentially allowing for multi-scale, multi-orientation, or learned, data-driven symmetries.
- Novel Design Axes: Enables new forms of view sampling and hyperparameter tuning beyond batch size and temperature.
Current limitations include the construction of backbones that are equivariant to arbitrary data augmentations; most group convolutional networks are limited to 0 or compact Lie groups. Approaches such as learned or approximate homomorphisms (e.g., topographic VAEs, NPTNs, L-convolutions) may be needed to extend H-SSL to the diverse augmentations employed in large-scale vision tasks.
Future research directions include hybrid models combining A-SSL (where equivariance is weak) with H-SSL (where group structure is strong), learning group representations 1 jointly with 2, and adapting H-SSL to Transformer architectures by constructing permutation- or patch-equivariant layers.
In summary, H-SSL reframes the diversity of contemporary self-supervised learning objectives as the outcome of constraining architectures to respect the underlying group structure of augmentations, with the InfoNCE loss serving as a universal objective across these contexts (Keller et al., 2022).