WSSL: Weighted Self-Supervised Learning

Updated 1 July 2026

Weighted Self-Supervised Learning is a framework that leverages explicit weighting at task, instance, spatial, or ensemble levels to enhance unsupervised representation learning.
It applies diverse weighting strategies to optimize pretext tasks, enabling robust performance improvements in computer vision, speech, and medical imaging.
Empirical results show that methods like BetaDataWeighter and LEWEL improve accuracy, reduce training time, and mitigate issues such as data imbalance.

Weighted Self-Supervised Learning (WSSL) refers to a collection of frameworks that improve self-supervised representation learning by integrating explicit weighting mechanisms at various levels—tasks, instances, spatial regions, or ensemble heads. By allocating learnable or data-driven weights in the unsupervised stage, WSSL seeks to optimize the extraction of transferable, diverse, and robust features without reliance on large-scale manual labels. This strategy appears across domains from computer vision and speech to medical imaging, enhancing model performance, data efficiency, and robustness to imbalance, heterogeneity, or background clutter.

1. Core Principles of Weighted Self-Supervised Learning

WSSL generalizes standard self-supervised learning by assigning explicit weights to loss components or data units during pretext training. The architectural and mathematical instantiations vary:

Task-level weighting: Each auxiliary or pretext task receives a scalar weight, controlling its influence on the shared encoder (e.g., rotation, colorization, or contrastive prediction losses summed with coefficients) (Gupta et al., 2022, Fuadi et al., 2023, Zaiem et al., 2021).
Instance/sample weighting: Each data example receives a learnable or data-driven weight, e.g., via Bayesian priors, per-example kernel density, or downstream task gradients (Ericsson et al., 2020, Liu et al., 2021).
Spatial/component weighting: Adaptively learn spatial masks or alignment maps to focus aggregation on object-relevant features, mitigating misalignment across augmentations (Huang et al., 2022).
Ensemble/branch weighting: Multiple heads/branches are weighted per data point, promoting diversity and stability in learned representations (Ruan et al., 2022, Fuadi et al., 2023).
Uncertainty weighting: Each task’s loss is scaled by a learnable variance parameter, with the network learning the trade-off in multi-objective or multi-modal settings (Huang et al., 2023).

The generic weighted SSL objective can be formulated as

$\mathcal{L}_\text{WSSL} = \sum_{i=1}^{N} w_i \mathcal{L}_i,$

where each $\mathcal{L}_i$ is a self-supervised loss (possibly at task, instance, or spatial level) and $w_i$ is its corresponding weight.

2. Task-Weighted and Multi-Task WSSL

Multi-task WSSL combines several self-supervised pretext tasks with learnable or manually set weights:

In image inpainting, the encoder is pre-trained by a weighted sum of rotation, saturation, and sharpness classification losses, with task weights selected via grid search for optimal downstream PSNR/SSIM (Gupta et al., 2022).
In speech/audio, task weights are calibrated via optimization over the conditional Hilbert-Schmidt Independence Criterion (HSIC) using a small labeled downstream set. This produces optimal $\lambda = (\lambda_1,\ldots,\lambda_k)$ for task group selection, enforced by softmax or sparsemax (Zaiem et al., 2021).

These designs enable selective feature emphasis, drive sparsity in auxiliary task selection, and enable downstream-task-adaptive representation learning.

3. Instance-Weighted and Sample-Adaptive WSSL

Instance weighting schemes assign per-example importance to unsupervised data:

BetaDataWeighter addresses domain shift by learning Bayesian per-instance weights $w_i$ (Beta priors), optimized so that weighted self-supervised representations maximize downstream classification accuracy. Once optimal, low-weighted instances can be pruned, reducing compute without accuracy loss (Ericsson et al., 2020).
Reweighted Sharpness-Aware Minimization (rwSAM) increases the regularization for rare instances in imbalanced data. Weights are assigned by inverse kernel density estimation on features; rarer examples receive higher weights, which are then used inside sharpness-aware minimization (Liu et al., 2021).

These strategies improve robustness to data imbalance, prune redundant data, and prioritize instances maximizing target performance.

4. Spatial and Component Weighting: Adaptive Aggregation

LEWEL (“Learning Where to Learn”) implements spatially weighted aggregation in self-supervised contrastive learning. The projection head is reinterpreted as a per-pixel map outputting alignment maps via softmax, which are then used to aggregate backbone features: $\mathbf{W}'_k = \mathrm{softmax}(\overline{\mathbf W}'_k),\qquad y'_k = [\mathbf W'_{(k-1)h+1}\bm\otimes\mathbf F'^{(1)}_{D/h},\,\ldots],$ yielding a spectrum of weighted embeddings. This spatial reweighting resolves object-background misalignment and augments both global and local feature learning, improving linear transfer, semi-supervised, and dense prediction benchmarks (Huang et al., 2022).

5. Weighted Ensemble and Branch Architectures

WSSL can also refer to ensembling multiple teacher-student heads and weighting their predictions per sample:

Weighted Ensemble SSL attaches $m$ projection heads/codebooks to a single encoder; per-example weights are assigned via entropic or confidence-based softmax. Loss is formulated as

$L(x) = \sum_{i,j} \sum_y w_{ijy}(x)\,t_i(y|x)\,[-\log s_j(y|x)],$

with diverse weighting strategies (“Unif,” “Prob,” “Ent”) (Ruan et al., 2022).

Empirically, entropy-weighted ensembles maximize head diversity and provide substantial gains in few-shot learning and linear probe performance, with minimal evaluation overhead, as only the primary encoder is retained.

6. Gated and Uncertainty-Weighted WSSL in Multi-Task Settings

Gated WSSL employs a lightweight softmax gating network to dynamically weight several SSL heads for local augmentations (e.g., rotation, flip, channel shuffle), allowing the model to focus on transformations most beneficial for classification (Fuadi et al., 2023). In contrast, when combining several regression or mapping tasks (as in medical imaging), uncertainty-weighted multi-tasking introduces learnable variances $\sigma_k$ : $L_\text{WSSL} = \frac{1}{2\sigma_1^2} L_{1\rho} + \frac{1}{2\sigma_2^2} L_2 + \log\sigma_1 + \log\sigma_2,$ permitting analytic loss weight scaling based on noise, and eliminating manual tuning (Huang et al., 2023).

7. Empirical Results, Impact, and Directions

Empirical studies demonstrate consistent gains across modalities and tasks:

On ImageNet-1K, LEWEL improves MoCo v2 by up to +1.6% and BYOL by +1.3% linear accuracy and further boosts semi-supervised and dense prediction transfer (Huang et al., 2022).
Weighted pretext selection notably increases ASR and instrument recognition metrics in speech SSL (Zaiem et al., 2021).
BetaDataWeighter prunes up to 78% of examples in STL-10, maintains or improves accuracy (+2.5%), and halves training time (Ericsson et al., 2020).
Weighted ensemble schemes boost few-shot top-1 accuracy on ImageNet-1K by +3.4 to +5.5 percentage points over base DINO/MSN (Ruan et al., 2022).
Instance weighting and rwSAM close or eliminate the generalization gap due to pre-training imbalance in visual SSL (Liu et al., 2021).

A prominent limitation observed is in weight selection (often grid-searched or calibrated a priori), with fully end-to-end or theoretically optimal weighting remaining an open question (Gupta et al., 2022). The development of more robust, automated, and domain-agnostic weighting strategies remains a significant research direction.

Key References:

"Learning Where to Learn in Cross-View Self-Supervised Learning" (Huang et al., 2022) "Gated Self-supervised Learning For Improving Supervised Learning" (Fuadi et al., 2023) "WSSL: Weighted Self-supervised Learning Framework For Image-inpainting" (Gupta et al., 2022) "Self-supervised Learning is More Robust to Dataset Imbalance" (Liu et al., 2021) "Weighted Ensemble Self-Supervised Learning" (Ruan et al., 2022) "Pretext Tasks selection for multitask self-supervised speech representation learning" (Zaiem et al., 2021) "Don't Wait, Just Weight: Improving Unsupervised Representations by Learning Goal-Driven Instance Weights" (Ericsson et al., 2020) "Uncertainty-weighted Multi-tasking for $\mathcal{L}_i$ 0 and T $\mathcal{L}_i$ 1 Mapping in the Liver with Self-supervised Learning" (Huang et al., 2023)