Uncertainty-Aware Pseudo-Label Selection

Updated 10 March 2026

The paper introduces a decision-theoretic UPS framework that leverages Bayesian and ensemble estimators for robust pseudo-label selection.
It quantifies epistemic and aleatoric uncertainty to mitigate confirmation bias and improve generalization across diverse applications.
UPS methods have demonstrated empirical accuracy gains in tasks like image recognition and medical segmentation by filtering noisy labels.

Uncertainty-Aware Pseudo-Label Selection (UPS) is a class of methods in semi-supervised learning (SSL) that augments or replaces naive pseudo-labeling strategies by explicitly quantifying and leveraging predictive uncertainty to guide which unlabeled samples are admitted as pseudo-labeled data during self-training. This paradigm aims to mitigate confirmation bias, improve generalization, and increase robustness, especially in regimes prone to overfitting, class imbalance, model mis-specification, and distribution shift. Methods in this category systematically incorporate both epistemic and aleatoric uncertainty via Bayesian, ensemble, conformal, or neighborhood-based estimators; employ principled, often decision-theoretic, selection rules; and have demonstrated consistent empirical gains across a wide spectrum of SSL applications including image recognition, natural language understanding, graph-based node classification, medical image segmentation, and domain adaptation.

1. Decision-Theoretic Foundations of UPS

UPS operationalizes pseudo-label selection as a statistical decision problem in contrast to classical heuristics that rely on confidence or entropy alone. Formally, the hypothesis space $\Theta$ encodes model parameters, and the action space $\mathcal{A}$ is the set of candidate pseudo-labeling actions $a_i$ (adding $(x_i, \hat y_i)$ to the training set). The utility is chosen as the joint (log-)likelihood of new and existing labeled data under parameter $\theta$ , i.e.,

$U(\theta, a_i) = \log p(\mathcal{D} \cup \{(x_i, \hat y_i)\} | \theta) + \log p(\theta)$

A Bayes-optimal action maximizes the posterior expected utility:

$\int U(\theta, a_i)\, \pi(\theta|\mathcal{D})\, d\theta$

This maximization is equivalent to selecting the candidate with the highest posterior predictive score, for which a Laplace approximation yields the criterion

$S(x^*, \hat y^*) = \log p(\hat y^*|x^*, \hat\theta) - \frac{1}{2} \log |\mathcal{I}(\hat\theta)|$

where $\mathcal{I}(\hat\theta)$ is the observed Fisher information matrix, and $\hat\theta$ is the MAP estimate (Rodemann, 2023).

2. Sources and Quantification of Uncertainty

UPS employs multiple sources of uncertainty:

Epistemic uncertainty reflects lack of knowledge about model parameters due to limited data. Bayesian posteriors, ensembles, and MC-Dropout are commonly used for its estimation (Rodemann, 2023, Dorigatti et al., 2022, Liu et al., 26 Mar 2025, Wang et al., 2023).
Aleatoric uncertainty models inherent data variability, often estimated via learned variance (e.g., diagonal covariance head) (Kazemi, 2022, Zhang et al., 2021).
Distributional (covariate) shift is incorporated by contrasting the likelihood of augmenting with selected pseudo-labels versus random (i.i.d.) pseudo-labels (Rodemann et al., 2023).
Calibration Uncertainty is quantified via regularized conformal predictors, which provide finite-sample coverage guarantees on prediction sets (Moezzi, 2023).

Ensemble disagreement, posterior predictive variance, Laplace approximations, and conformal scores are used throughout the literature for explicit uncertainty measurement.

3. UPS Algorithmic Framework: Key Variants

A canonical UPS algorithm alternates between model retraining and uncertainty-based pseudo-label selection. Selection rules vary but share the following principles:

Score candidates by predictive fit penalized by model uncertainty, using, e.g., posterior predictive distributions or model entropy.
Filter using explicit uncertainty measures—either threshold (e.g., only admit samples with $\mathrm{MI}(x) < t_l$ or entropy below a quantile) or continuous weighting (e.g., $w_i = \exp(-u_i)$ , $w_i = 1/(u_i U_\text{max}+1)$ ).
Handle multi-objective utility: Some frameworks define a vector utility to simultaneously account for robustness to model selection, error accumulation, and covariate shift, with selection by generalized stochastic dominance or $\alpha$ -cut rules on credal priors (Rodemann et al., 2023, Rodemann, 2023).
Iterative update: After pseudo-label selection, retrain the model on the augmented dataset and repeat until a stopping criterion is met.
Negative label integration: Some UPS methods also assign negative pseudo-labels (i.e., confident absence) for additional constraints in both single- and multi-label settings (Rizve et al., 2021, Moezzi, 2023).

A general pseudocode template is:

for self-training iteration:
    fit model to current labeled set
    for each candidate (x, \hat y):
        compute uncertainty U(x, \hat y)
        compute predictive score S(x, \hat y)
    select/rank candidates using S - uncertainty penalty, or via multi-objective aggregation
    add top candidates to labeled set

4. Application Domains and Workflow Adaptations

UPS has been successfully deployed in diverse learning settings:

Positive-Unlabeled learning (PU): PUUPL combines epistemic uncertainty from deep ensembles with non-negative risk minimization to control confirmation bias and improves performance on highly imbalanced data (Dorigatti et al., 2022).
Graph neural networks: Node-level uncertainty is estimated via variational encoders or perturbation-based methods (e.g., selective edge removal), with stochastic smoothing in soft pseudo-label distributions to suppress overconfident errors (Liu et al., 26 Mar 2025, Teimuri et al., 2 Feb 2025).
Domain adaptation: Uncertainty-aware selection filters noisy pseudo-labels during domain transfer, leveraging both neighborhood consistency and sample-level confidence to prioritize robust knowledge transfer (Zheng et al., 2020, Chen et al., 2024).
Segmentation and keypoint regression: Structured uncertainty (pixel/region entropy, MC-dropout, ensemble statistics) is used for selective inclusion or reweighting of pseudo-labels during mask or keypoint prediction (Bui-Tran et al., 29 Oct 2025, Yang, 24 Sep 2025, Wu et al., 13 Mar 2025).
Multi-modal tasks: Conformal prediction-based uncertainty sets and ensemble disagreement are used to provide coverage guarantees and modality-agnostic uncertainty quantification (Moezzi, 2023, Kazemi, 2022).

5. Confirmation Bias Mitigation and Robustness

A central benefit of UPS is the mitigation of confirmation bias—propagation of overconfident, erroneous pseudo-labels—which is especially important in the early or low-data-regime phases of training (Rodemann, 2023). Complexity penalties (e.g., $- \frac{1}{2} \log |\mathcal{I}(\hat\theta)|$ ) systematically penalize sharp modes and overconfident parameter regions. Robust extensions, such as credal sets and $\alpha$ -cut inference, further guard against model misspecification by avoiding domination by a single model or prior. Model selection and early stopping can be integrated via explicit uncertainty tracking.

6. Empirical Performance and Practical Guidelines

UPS methods consistently outperform naive confidence-based selection and many consistency-based SSL baselines across vision, language, and biomedical tasks:

Accuracy gains of 3–7 percentage points on CIFAR-10/CIFAR-100 relative to naive pseudo-labeling, even matching or surpassing strong augmentation-based consistency methods (Moezzi, 2023, Rizve et al., 2021).
Substantial improvements (5–15%) in regimes prone to overfitting or under severe distribution shift (Rodemann, 2023).
For PU learning and extreme imbalance, absolute AUROC or accuracy boosts of up to 3–4% (Dorigatti et al., 2022).
Ablation studies unanimously demonstrate that the uncertainty-filtering step accounts for the majority of performance gains, especially in early epochs, minority classes, or low-resource regimes.

Key practical recommendations include:

Use epistemic uncertainty (ensemble or dropout-based mutual information) for selection; if computationally constrained, conformal or entropy-based approximations suffice.
Consider continuous weighting (long-tailed or exponential) of pseudo-labels rather than hard thresholds if threshold tuning is problematic (Wu et al., 13 Mar 2025).
Employ robust, multi-objective selection rules if model misspecification or distribution drift is anticipated (Rodemann et al., 2023, Rodemann, 2023).
For class imbalance, adapt per-class thresholds and selection ratios based on uncertainty statistics (Yang et al., 2024).
Re-initialize models at each self-training round to prevent the accumulation of confirmation bias from inherited label noise (Dorigatti et al., 2022, Rizve et al., 2021).

7. Limitations and Future Directions

UPS frameworks entail additional computational requirements, especially for ensemble, MC-dropout, or conformal predictor-based uncertainty estimators, though recent methods aim to minimize these overheads (single-head estimators, lightweight neighborhood-based uncertainty, and mixup with contrastive regularization) (Kazemi, 2022, Wu et al., 13 Mar 2025). The calibration of thresholding or weighting functions remains application-dependent, though continuous strategies reduce manual tuning. Some approaches require held-out calibration sets or introduce additional architecture components (e.g., conformal set layers, orthogonal certificates). Extension of UPS principles to structured outputs, regression, and continual learning remains under active investigation (Moezzi, 2023).

In summary, Uncertainty-Aware Pseudo-Label Selection provides a rigorous, theoretically grounded, and empirically validated mechanism for robustly expanding labeled data in semi-supervised learning. By leveraging explicit uncertainty quantification, Bayesian or decision-theoretic embedding, and robust multi-objective utilities, UPS methods systematically improve pseudo-label quality, generalization, and resilience to confirmation bias across a wide range of domains and data modalities (Rodemann, 2023, Rizve et al., 2021, Dorigatti et al., 2022, Liu et al., 26 Mar 2025, Cai et al., 2021, Moezzi, 2023, Rodemann et al., 2023, Bui-Tran et al., 29 Oct 2025).