Self-Supervised Learning in Function Space

Updated 6 October 2025

Self-supervised learning in function space is characterized by the direct regularization of neural functions using L2 penalties to stabilize learning and reduce catastrophic forgetting.
The approach leverages tools from functional analysis, kernel theory, and explicit function encoders to drastically improve sample efficiency and robustness.
This paradigm unifies contrastive and invariance-based methods with rigorous functional constraints, offering enhanced generalization guarantees and scalable learning strategies.

Self-supervised learning in function space encompasses both the theoretical foundations and practical strategies for learning representations or operators without labels, focusing on the properties and dynamics of neural networks as functions mapping inputs to outputs. This approach draws on tools from functional analysis, kernel theory, regularization, and geometric representation, unifying and extending self-supervised learning to emphasize invariance, stability, and sample efficiency at the level of functions rather than parameters.

1. Foundations: Measuring and Regularizing in Function Space

The act of training neural networks is fundamentally about shaping functions $f : \mathcal{X} \to \mathcal{Y}$ . In function-space approaches, central attention is given to distances and dynamics directly in a space of functions—usually the $L^2$ Hilbert space defined via

$\|f - g\|^2 = \int_\mathcal{X} |f(x) - g(x)|^2 d\mu(x)$

which is empirically estimated as

$\|f - g\|^2 \approx \frac{1}{N} \sum_{i=1}^N |f(x_i) - g(x_i)|^2.$

Parameter updates in weight space often correspond to nontrivial and poorly controlled functional changes, especially early in training. Empirical results demonstrate that the $L^2/\ell^2$ ratio (i.e., the magnitude of function change per unit parameter change) is high at first and then decreases and plateaus as the test error stabilizes (Benjamin et al., 2018). This observation motivates regulating the "distance traveled" in function space during training to stabilize learning and inhibit catastrophic forgetting.

The Hilbert-Constrained Gradient Descent (HCGD) rule augments the standard loss $C_0$ with an $L^2$ penalty on the update: $C = C_0 + \lambda \| f_{\theta_t} - f_{\theta_t + \Delta\theta} \|$ encouraging updates that minimally interfere with already acquired invariances or representations.

2. Functional Regularization, Representation Pruning, and Sample Efficiency

Many self-supervised learning (SSL) frameworks can be unified as instances of functional regularization (Garg et al., 2020). Here, the auxiliary (self-supervised) loss is posed not only as an unsupervised target but as an explicit constraint on the function $h(\cdot)$ realized by the encoder or intermediate network components. For example, in auto-encoders, masked modeling, or other pretext tasks, the auxiliary loss is of the form $L_r(h, g; x)$ , with $g$ acting as a learnable regularizer.

This leads to constrained learning problems: $\min_{f \in \mathcal{F}, h \in \mathcal{H}} L_c(f, h; S) \quad \text{s.t.} \quad L_r(h; U) \leq \tau$ where $L_c$ is the supervised (prediction) loss, and $L_r$ prunes the hypothesis class for $h$ to those compatible with the data's structure. The net effect is a dramatic reduction in the effective complexity of the function space, formalized as a reduction in covering numbers or hypothesis class cardinality for $\mathcal{H}_{\mathcal{D}_X, L_r}(\tau) \ll |\mathcal{H}|$ . This shrinkage reduces labeled sample complexity. Explicit estimates and experiments confirm that, for example, in synthetic and real-data settings, SSL-based functional regularization may reduce the number of labeled examples by an order of magnitude to reach the same error (Garg et al., 2020).

Visualization and clustering of learned functions also demonstrate that regularization yields function approximations occupying a more compact and regularized subset of function space, rather than being scattered.

3. Self-Supervision, Invariance, and Functional Architectures

Self-supervision exploits invariances (to transformations, augmentations, or masked input), which can be formally encoded as properties of mappings $f$ . Modern frameworks like contrastive learning (SimCLR, MoCo, CLAMP (Zhang et al., 16 Jun 2025)) or invariance-based regularization (Kernel VICReg (Sepanj et al., 8 Sep 2025), Joint Embedding in Kernel Regime (Kiani et al., 2022)) do so by enforcing proximity of representations of augmented instances and feature decorrelation or variance spread.

These methods can be classified by the space in which they operate:

Euclidean space: Standard contrastive and non-contrastive SSL penalize vector differences or covariances.
Reproducing Kernel Hilbert Space (RKHS): Kernel VICReg lifts the loss, operating on kernel (rather than raw vector) covariances and inner products, utilizing terms like double-centered kernel matrices and Hilbert–Schmidt norms to induce nonlinearity and better capture the manifold structure inherent in the data (Sepanj et al., 8 Sep 2025).
Explicit Function Encoders: Neural models which learn basis functions $\{\psi_i\}$ construct explicit finite-dimensional spaces (without kernels) and allow task-agnostic but data-structure-adaptive representations. These are especially efficient at inference and offer kernel-theoretic guarantees (Low et al., 24 Sep 2025).

In functional transduction frameworks (Chalvidal et al., 2023), architectures are meta-trained to map few-shot examples directly to new function approximations using vector-valued RKBS machinery. Similarly, control-theoretic function learning via controlled difference equations in RKHS provides a principled optimization of mappings under general costs, including purely self-supervised or pretext losses (Colusso et al., 22 Apr 2024).

4. Dynamics, Learning Stability, and Catastrophic Forgetting

Analysis of the $L^2/\ell^2$ ratio (Benjamin et al., 2018) and explicit function-space regularization expose why function-based learning stabilizes representations. Especially in sequential or multitask learning, constraining the change in function space across tasks or updates ensures previous invariances are preserved.

In continual or open-world learning, function-space regularization and dual-parameterization (Scannell et al., 16 Mar 2024) permit compact summarization of previously learned representations (via inducing points, dual gradients), facilitating stable incorporation of new data without retraining and mitigating interference. This approach yields improved robustness to catastrophic forgetting and allows efficient adaptation in streaming and incremental scenarios (as evidenced in split-MNIST, image classification, and RL settings) (Scannell et al., 16 Mar 2024, Dhamija et al., 2021).

Moreover, methods that optimize information measures in function space, such as maximizing the log-determinant of the feature covariance (Ozsoy et al., 2022), avoid total and dimensional collapse and result in well-spread, expressive representations suitable for downstream adaptation and transfer.

5. Kernel and Non-Kernel Hilbert Space Methods

Kernelization of self-supervised objectives (as in Kernel VICReg (Sepanj et al., 8 Sep 2025) and Joint Embedding in Kernel Regime (Kiani et al., 2022)) replaces Euclidean operations with their RKHS analogs—double-centered kernels, Hilbert–Schmidt norms, operator norms, and kernelized variance and covariance penalties. A central benefit is the ability to model highly nonlinear relationships and align learned representations with the true data geometry. Downstream performance, cluster isometry, and class separation consistently improve for complex and small-scale datasets, as confirmed by empirical and visualization studies.

Non-kernel Hilbert space approaches using neural function encoders (Low et al., 24 Sep 2025) offer explicit, data-adaptive basis sets. Progressive and train-then-prune strategies, coupled with PCA on coefficient covariances, reveal and match the intrinsic dimension of the function space, providing efficient and generalizable representations with Rademacher and PAC-Bayes generalization guarantees. Their inference-time behavior is governed by the number of basis functions, not dataset size, making them highly scalable.

6. Algorithmic and Theoretical Implications: Generalization, Regularization, and Learning Rates

Function-space approaches permit sharper analyses and algorithmic control:

Sample Complexity Reduction: PAC and covering-number bounds confirm that pruning function space drastically decreases the number of labeled samples needed (Garg et al., 2020).
Explicit Function-Space Learning Rate Measurement: Recent advances introduce efficient algorithms to measure and control the actual change in output function induced by weight updates, yielding architecture-agnostic, variance-reduced estimators of function-space learning rates (Milsom et al., 24 Feb 2025). Function-space Learning Rate Matching (FLeRM) transfers hyperparameters robustly across model scales.
Empirical Bayes Function-Space Regularization: Probabilistically-motivated regularizers such as FSEB (Rudner et al., 2023) define empirical priors over desired function outputs (e.g., via Mahalanobis distance in feature space). This enables encoding explicit functional preferences (such as smoothness, uncertainty, or task adaptation) directly and improves robustness under distribution shift, transfer, and selective prediction.

7. Extensions: Reinforcement Learning, Control, and Manifold Structure

Function-space ideas extend to self-supervised reinforcement learning and control, as illustrated by neural Lyapunov function approximation with function-space reward augmentation (McCutcheon et al., 19 Mar 2025), and control-theoretic systems with function evolution governed by learnable controls (Colusso et al., 22 Apr 2024).

Function space methods also connect to geometric and physical principles. The CLAMP framework (Zhang et al., 16 Jun 2025) interprets contrastive SSL as a manifold packing problem, leveraging repulsive energy losses inspired by condensed matter physics to organize augmentation sub-manifolds in embedding space—paralleling manifold organization in biological neural systems. Parameters such as ellipsoid scale in the loss provide geometric control, and the emergent separation of neural manifolds matches theoretical expectations for functional classification.

In summary, self-supervised learning in function space synthesizes direct measurement and regularization of mappings, sample complexity reductions via functional constraints, kernel and basis-function methodologies, and insights from physics, probability, and geometry. Theoretical guarantees, empirical improvements in robustness, and practical strategies for continual, incremental, and large-scale learning are now closely tied through explicit function-space representations and analyses. These developments enable a principled and versatile toolkit for representation learning, robust transfer, and efficient inference, with broad implications for both foundational research and applied machine learning.