Semi-Supervised Re-Labeling Methods

Updated 25 April 2026

Semi-supervised re-labeling is a method that augments a limited labeled dataset with high-confidence pseudo-labels from model predictions, domain priors, or learned functions.
Techniques such as confidence-based filtering, curriculum scheduling, and graph propagation balance label noise with coverage, enhancing performance in data-scarce scenarios.
Empirical studies demonstrate that these methods yield significant boosts in accuracy and robustness, particularly in domains with sparse or costly annotations.

Semi-supervised re-labeling is a family of techniques in semi-supervised learning (SSL) that augment a small labeled dataset with automatically generated labels (pseudo-labels) for some or all of the unlabeled data. The central idea is to exploit model predictions, domain- or graph-based priors, or learned labeling functions in order to impute high-confidence or structurally reliable labels on unlabeled examples, then retrain or fine-tune models using this augmented set. The methodology underpins a wide range of recent advances in SSL for both classification and structured prediction—particularly in data-scarce regimes and domains where annotation is expensive or infeasible.

1. Problem Formalization and Canonical Frameworks

Across this literature, one typically observes the following abstraction. There is a labeled set $D_L = \{(x_i, y_i)\}_{i=1}^N$ and an unlabeled set $D_U = \{x_j\}_{j=1}^M$ . The goal is to exploit $D_U$ —often much larger than $D_L$ —to improve generalization of a model $f_\theta$ . The prototypical semi-supervised re-labeling pipeline comprises (i) an initial model fit to $D_L$ , (ii) selection and assignment of pseudo-labels $\hat{y}$ to a candidate subset of $D_U$ , and (iii) retraining or fine-tuning the model on $D_L \cup \{(x, \hat{y})\}$ , possibly with weighting or filtering.

Methods diverge substantially in how they select $x \in D_U$ for labeling, how $D_U = \{x_j\}_{j=1}^M$ 0 is determined, how label noise is mitigated or corrected, and whether these steps are performed as a one-shot expansion, an iterative curriculum, or within special architectures (e.g., GNNs or co-training) (Gross et al., 1 Feb 2026).

2. Pseudo-label Assignment: Strategies and Theoretical Rationale

Confidence-Based Filtering

The most classical form, found in self-training and pseudo-labeling (Cascante-Bonilla et al., 2020, Radhakrishnan et al., 2023), assigns $D_U = \{x_j\}_{j=1}^M$ 1 to any $D_U = \{x_j\}_{j=1}^M$ 2 where the model's confidence $D_U = \{x_j\}_{j=1}^M$ 3 exceeds threshold $D_U = \{x_j\}_{j=1}^M$ 4. Only these high-confidence pseudo-labels are admitted to the training pool, trading off coverage versus label noise.

Curriculum and Progressive Scheduling

Curriculum labeling (CL) (Cascante-Bonilla et al., 2020) replaces a fixed $D_U = \{x_j\}_{j=1}^M$ 5 with an adaptive percentile threshold. Iteratively, the model pseudo-labels the easiest (highest-confidence) samples first and gradually admits harder samples by lowering the threshold according to a schedule $D_U = \{x_j\}_{j=1}^M$ 6. After each cycle, the model is retrained from scratch to prevent confirmation bias. Progressive representative labeling (PRL) (Yan et al., 2021) further generalizes this to graph-based settings, labeling only the most "representative" (high-indegree) samples and iteratively propagating labels outward.

Oracle-Filtered Selection

In domains where the cost of a false pseudo-label is high or model predictions are unreliable (e.g., CAPP transformer planning (Gross et al., 1 Feb 2026)), an explicit oracle classifier $D_U = \{x_j\}_{j=1}^M$ 7 is trained to predict the correctness of a candidate pseudo-label by extracting a rich feature vector $D_U = \{x_j\}_{j=1}^M$ 8 from the model's output statistics (e.g., logit margin, entropy, temporal features). Only pairs with $D_U = \{x_j\}_{j=1}^M$ 9 are admitted; $D_U$ 0 is set for a target precision.

Graph- and Embedding-Based Label Propagation

Instead of relying on classification confidence, some frameworks assign pseudo-labels by leveraging geometric or topological properties in an embedding space. HDL (Ma et al., 2024) hierarchically re-labels unlabeled samples by majority-vote among k-nearest labeled neighbors in the embedding space, iteratively expanding the labeled set to maximize cascade effects. PRL (Yan et al., 2021) uses kNN-graph indegree as a robustness criterion for representativeness. Feature affinity-based methods (Ding et al., 2018) compute soft pseudo-labels directly from cosine similarities to labeled cluster centers.

Label propagation (LP) (Albert et al., 2020) constructs a similarity graph (e.g., from self-supervised feature encodings), relaxes labels over the entire dataset using Laplacian propagation, and selects subsets with low empirical loss as reliable pseudo-labels.

Out-of-Distribution and Class-Mismatch

Class-mismatch scenarios (unlabeled data contains OOD classes) degrade naive pseudo-labeling. Re-balanced pseudo-labeling (RPL) (Han et al., 2023) enforces class balance among high-confidence pseudo-labels, dropping OOD samples that would otherwise cluster unnaturally. For low-confidence, likely OOD samples, semantic exploration clustering (SEC) assigns additional "extra-class" pseudo-labels via balanced optimal transport clustering.

3. Label Correction, Noise Mitigation, and Confirmation Bias

Label noise is a primary challenge in re-labeling frameworks, as errors are difficult to detect and can reinforce themselves. Multiple approaches address this:

Oracle-based filtering (Gross et al., 1 Feb 2026) actively rejects candidates whose feature-vectors contain error signatures.
Self-distillation and graph-based correction (Xiao et al., 2024): GNN-based MLLC alternates between embedding-space propagation (enforcing feature smoothness) and class-consistency propagation (enforcing label agreement), enabling error detection and correction by exploiting structural proximities.
Re-weighting: Losses on pseudo-labeled samples are weighted by model-calibrated confidence (Yao et al., 2022), entropy (Yao et al., 2022), or statistical measures (Grau et al., 2020).
Restarting and curriculum (Cascante-Bonilla et al., 2020): Model parameter restarts between pseudo-labeling cycles break feedback loops, curbing confirmation bias.
Co-training/disagreement (Nassar et al., 2021, Yao et al., 2022): Ensembling or training multiple heads with different inductive biases, and promoting agreement/disagreement, yields more robust pseudo-label pools.

4. Algorithmic Variants and End-to-End Pipelines

Below, several representative pipelines exemplify the contemporary design space of semi-supervised re-labeling.

Paper [arXiv ID]	Pseudo-label Generation	Label Correction/Selection	Retraining Paradigm
Semi-supervised CAPP (Gross et al., 1 Feb 2026)	Transformer predictions, filtered by XGBoost oracle	Oracle outputs probability; use threshold for inclusion	One-shot retraining with labeled + pseudo-labeled
Curriculum Labeling (Cascante-Bonilla et al., 2020)	Confidence, percentile-based curriculum	Adaptive threshold, model restarts	Iterative cycles, full restarts
PRL (Yan et al., 2021)	Indegree sampling on kNN graph, GNN labeler	Only high-indegree samples, progressive waves	GNN for labeling, then full model fine-tune
HDL (Ma et al., 2024)	Embedding-space kNN majority, hierarchical selection	Outlier-tolerant, no class-prediction used	Re-label followed by re-training
MLLC (Xiao et al., 2024)	Initial predictions, GNN propagation on dual graphs	Refined by alternated SLG/CLG steps, per-pixel correction	Joint GNN + backbone optimization

Some methods are designed for specific domains (e.g., sequence prediction, semantic segmentation, remote physiological measurement), but all rely on the central re-labeling motif: generate artificial supervision on $D_U$ 1, control label noise (via confidence, structure, or explicit correction), and retrain in a way that prevents runaway confirmation bias.

5. Practical Analysis and Empirical Findings

Across domains, semi-supervised re-labeling methods yield substantial improvements over both supervised-only baselines and classical self-training/pseudo-labeling. Key system-level findings include:

Data Scarcity Sensitivity: Marginal gains from re-labeling are largest when labeled data is most scarce, e.g., +11% accuracy on capp1 (1% of data) with oracle filtering over random selection (Gross et al., 1 Feb 2026), +11.25 mIoU on Cityscapes 1/30 labeled (Xiao et al., 2024), +3–4% top-1 on ImageNet 10% labeled with PRL (Yan et al., 2021).
Robustness to OOD/Covariate Shift: Curriculum and re-balanced pseudo-labeling substantially reduce error from OOD contamination, as shown in both controlled class-mismatch (Han et al., 2023) and curriculum-based robustness ablations (Cascante-Bonilla et al., 2020).
Confirmation Bias Mitigation: Restarts (Cascante-Bonilla et al., 2020), data programming with bi-level LF weighting (Maheshwari et al., 2021), and cross-network exchange (Yao et al., 2022) all demonstrate improved performance and label stability compared to naive iterative re-labeling, where early mistakes reinforce recursively.
Interpretability and Surrogate Extraction: Re-labeling pipelines can also serve interpretable learning by filtering with a calibrated black-box and distilling into a transparent surrogate, with amending via confidence or rough-set theory further improving both accuracy and model simplicity (Grau et al., 2020).

6. Domain-Specific Instances and Generalizations

While much of the literature concerns classification benchmarks, re-labeling has led to specialized innovations across data types:

Sequential/Structured Outputs: Oracle-filtered one-shot pseudo-labeling for transformer CAPP models and process planning (Gross et al., 1 Feb 2026).
Semantic Segmentation: Multi-level label correction leveraging dual GNNs over pixels and classes yields state-of-the-art results in highly limited label regimes (Xiao et al., 2024).
Biomedical Signals: Curriculum pseudo-labeling with domain-specific SNR ranking, enforcing only the highest-quality signals are enrolled as supervision (Wu et al., 6 Feb 2025).
Graph and Manifold Learning: PRL and HDL generalize the pseudo-labeling paradigm to graph structures and deep embedding spaces, where geometric generalization supersedes raw confidence (Yan et al., 2021, Ma et al., 2024).
Data Programming: Re-labeling via weighted aggregations of noisy labeling functions (LFs) with robust bi-level optimization (Maheshwari et al., 2021).

These designs are united by a structural approach: (i) augmenting data with automatically generated targets, (ii) gating on multi-criterion quality measures, (iii) re-training for final inference. The selection and correction strategies are usually tailored to the representational and noise characteristics of both the task and the model class.

7. Limitations, Best Practices, and Prospective Directions

Common challenges include:

Residual Label Noise: Even elite classifiers or oracles occasionally admit erroneous pseudo-labels. Multi-stage filtering and more conservative thresholds can reduce noise, but at the cost of recall (Gross et al., 1 Feb 2026). This inherent trade-off is fundamental.
Computational Overhead: Label propagation or dual-graph GNNs impose quadratic cost in sample size; scalability must be addressed (e.g., via approximate neighbor search) for very large $D_U$ 2 (Ma et al., 2024, Xiao et al., 2024).
Class Imbalance/OOD Coverage: Methods such as RPL+SEC directly address these with enforced class balance and auxiliary clustering (Han et al., 2023).
Model Dependence: Embedding-based methods require high-quality, discriminative representations. An untrained encoder or weak backbone can degrade clusterability and propagate poor labels (Ma et al., 2024).
Optimization Instabilities: Bi-level optimization with respect to labeling function weights is nontrivial but empirically critical for robust data programming pipelines (Maheshwari et al., 2021).

Promising extensions include multi-oracle filtering, curriculum schedules with adaptive noise ramp-up, end-to-end joint optimization of model and selection mechanism, and integration with active learning and open-set detection frameworks (Gross et al., 1 Feb 2026, Radhakrishnan et al., 2023). Generalization to other generative or sequence tasks is immediate when error patterns and suitable feature extractors for selection are available.

In sum, semi-supervised re-labeling leverages automated, quality-controlled expansion of labeled data to tightly couple unlabeled data exploitation with direct mechanisms for error and bias control. These approaches have been shown to consistently yield state-of-the-art results, particularly in regimes of data scarcity, class imbalance, or distributional shift.