Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research (2504.13101v2)
Abstract: Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL's empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.
Summary
- The paper proposes Singular Identifiability Theory (SITh) to bridge the gap between empirical practices and idealized models in self-supervised learning.
- It extends Identifiability Theory by incorporating realistic data augmentations, finite-sample effects, training dynamics, and architectural biases.
- The study offers practical insights for designing and evaluating robust, interpretable SSL systems with improved convergence and generalization.
This paper argues that while Self-Supervised Learning (SSL) has driven significant progress in AI, its advancement is hampered by a gap between empirical practices and theoretical understanding. Current methods often yield surprisingly similar representations, a phenomenon termed the Platonic Representation Hypothesis (PRH) (Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research, 17 Apr 2025), but the underlying reasons remain unclear. The authors propose extending existing Identifiability Theory (IT) into a more empirically grounded framework called Singular Identifiability Theory (SITh) to accelerate SSL research.
Introduction to Identifiability Theory (IT) for Practitioners
IT provides tools to understand if the underlying latent factors (or "ingredients") that generate data can be recovered from the observations alone.
- Data Generating Process (DGP): IT models data generation as a process (the DGP) where latent variables z (e.g., object type, color, pose) are transformed by a function f (the "renderer" or decoder) to produce observations x (e.g., images).
- Goal: The core question is whether we can learn an encoder g such that g(x) recovers z, possibly up to some acceptable ambiguities (an equivalence class, e.g., permutation or scaling of latents).
- Example - SimCLR: The paper illustrates IT's value using SimCLR (A Simple Framework for Contrastive Learning of Visual Representations, 2020).
- Initial empirical success of SimCLR.
- Theoretical analysis revealed its loss optimizes for representation alignment (pulling similar views together) and uniformity (pushing different samples apart) (Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, 2020).
- Further IT work showed SimCLR identifies a specific DGP where latents are uniform on a hypersphere and augmentations follow a simple isotropic distribution (von Mises-Fisher) (Autonomous stabilization of photonic Laughlin states through angular momentum potentials, 2021). This made the implicit assumptions explicit but also highlighted their limitations (e.g., real augmentations like cropping aren't isotropic).
- Subsequent work aimed for more realistic DGPs (Avoiding barren plateaus via Gaussian Mixture Model, 21 Feb 2024), but gaps remain (e.g., explaining dimensional collapse).
The Case for Singular Identifiability Theory (SITh)
The authors argue current IT is too idealized and propose SITh to bridge the gap by incorporating empirical realities. Key areas where current IT falls short and SITh should focus include:
Data Augmentations:
- Practice: Augmentations are critical, and their choice often matters more than the specific SSL algorithm (Spin-dependent edge states in two-dimensional Dirac materials with a flat band, 22 Feb 2024).
- Theory Gap: Current DGPs in IT use overly simplistic augmentation models (like isotropic vMF) that don't reflect complex, practical augmentations (e.g., large crops, color jitter).
- SITh Goal: Develop DGPs that realistically model the augmentations used in practice, potentially explaining why certain augmentations work better than others.
- Finite Data & Batch Size:
- Practice: Data set size (scaling laws) and batch size significantly impact performance. Data diversity (distinct environments or tasks) is also crucial.
- Theory Gap: Most IT results assume infinite data and infinite batch sizes. Finite-sample analysis is rare (Averaging for the dispersion-managed NLS, 2022). The distinction between data size and data diversity (in the ICA sense) is often blurred in practice.
- SITh Goal: Provide theoretical understanding of finite-sample effects, the role of batch size, and the specific type of data diversity needed for identifiability.
- Finite Time & Training Dynamics:
- Practice: Models train for finite steps; convergence speed varies; loss saturation occurs; phenomena like grokking exist. Sometimes optimal performance comes before full convergence or apparent loss saturation (Artificial Intelligence-Based Methods for Precision Medicine: Diabetes Risk Prediction, 2023).
- Theory Gap: IT focuses on the converged optimum, ignoring training dynamics, convergence rates, and issues like loss saturation near the optimum.
- SITh Goal: Analyze training dynamics, understand factors influencing convergence speed (e.g., initialization, augmentations, negative sampling), and explain phenomena like "partial" identifiability where some latents are learned faster or better than others (On the sum of the values of a polynomial at natural numbers which form a decreasing arithmetic progression, 2021).
- Architecture & Inductive Biases:
- Practice: Specific architectural choices (stop-gradients, predictor networks, specific initializations, optimizers) are common and seem necessary, especially for non-contrastive methods (VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning, 2021, Emerging Properties in Self-Supervised Vision Transformers, 2021).
- Theory Gap: IT often ignores these architectural details or assumes simplified models (e.g., linear layers after a non-linear encoder). The role of most inductive biases (beyond function class or loss) isn't well captured.
- SITh Goal: Determine the necessary and sufficient architectural components, initializations, and optimization properties for identifiability, potentially simplifying current complex pipelines (as hinted by methods like DIET (Multivariate Online Linear Regression for Hierarchical Forecasting, 22 Feb 2024, Cross-Domain Policy Adaptation by Capturing Representation Mismatch, 24 May 2024)).
- Dimensional Collapse & Projector:
- Practice: Often, the best representations for downstream tasks come from intermediate layers, not the final output layer (the "projector phenomenon"). The final layers seem to collapse dimensions or lose information (Vision Transformer: Vit and its Derivatives, 2022, A Simple Framework for Contrastive Learning of Visual Representations, 2020). Regularizers or tricks are used to mitigate this.
- Theory Gap: IT doesn't directly explain the projector phenomenon, although some log-linear models used in IT proofs (Inductive Biases for Deep Learning of Higher-Level Cognition, 2020, Cross-Domain Policy Adaptation by Capturing Representation Mismatch, 24 May 2024) might implicitly relate to a linear projector. Why collapse happens and how to prevent it isn't fully theorized.
- SITh Goal: Provide a theoretical explanation for dimensional collapse and the projector, potentially leading to principled ways to avoid it without heuristics. Determine if the final linear layer in some IT models is the projector.
- Generalization (OOD & Compositionality):
- Practice: Achieving robust OOD and compositional generalization is a major goal but remains challenging empirically (Bi-Lipschitz embeddings of quasiconformal trees, 2021, Anatomizing Deep Learning Inference in Web Browsers, 8 Feb 2024).
- Theory Gap: Some recent IT work addresses compositional generalization by imposing structure (e.g., additivity) on the DGP (Non-contact Sensing for Anomaly Detection in Wind Turbine Blades: A focus-SVDD with Complex-Valued Auto-Encoder Approach, 2023, Existence and uniqueness of slightly compressible Boussinesq's flow in Darcy-Bénard problem, 2023, On Defeating Graph Analysis of Anonymous Transactions, 28 Feb 2024). However, this doesn't cover all scenarios or explain why empirical methods often fail.
- SITh Goal: Expand theoretical results for OOD/compositional generalization, potentially across modalities (vision-language), and understand the role of inductive biases for different data types.
- Unifying Contrastive & Non-Contrastive Methods:
- Practice: Methods are often categorized as contrastive (CL) or non-contrastive, yet their learned representations can be very similar (Spin-dependent edge states in two-dimensional Dirac materials with a flat band, 22 Feb 2024, Signature of the atmospheric asymmetries of hot and ultra-hot Jupiters in lightcurves, 19 Feb 2024) (see Fig. 1).
- Theory Gap: IT guarantees exist mainly for CL methods. Theoretical backing for non-contrastive methods is sparse (except DIET). The underlying principles (entropy maximization, invariance) seem shared (Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, 2020, Mesoscopic Möbius ladder lattices as non-Hermitian model systems, 2022).
- SITh Goal: Develop a unified theoretical view, potentially proving identifiability for non-contrastive methods and explaining when and why different approaches lead to similar or different representations and convergence behaviors.
- Evaluation:
- Practice: Evaluation often relies heavily on downstream ImageNet classification accuracy, which may not reflect true "universality" or capture all relevant learned factors. Aggregate statistics (e.g., rank measures (Towards Distributed 2-Approximation Steiner Minimal Trees in Billion-edge Graphs, 2022, Analytic description of monodromy oscillons, 2023)) show promise but lack theoretical grounding. Large datasets with ground-truth latents are scarce.
- Theory Gap: IT inherently defines what latents should be learned given a DGP, providing a basis for principled evaluation.
- SITh Goal: Leverage the DGP concept to define more comprehensive evaluation protocols beyond single downstream tasks. Guide the creation of better benchmark datasets (potentially synthetic with known latents, like DisLib (Accurate Taylor transfer maps for large aperture iron dominated magnets used in charged particle separators and spectrometers, 2019)) to directly assess identifiability.
Conclusion and Position
The paper concludes that relying solely on empirical scaling or algorithmic tweaks is insufficient. Progress requires bridging the theory-practice gap. SITh is proposed as a research program to achieve this by building identifiability theories grounded in the realities of SSL data, architectures, training procedures, and evaluation needs. By focusing on realistic DGPs informed by empirical observations, SITh aims to provide principled guidance for designing, evaluating, and understanding more robust, interpretable, and generalizable SSL systems.