Class-Balanced Pseudo-labeling (CBPL)

Updated 29 September 2025

Class-Balanced Pseudo-labeling (CBPL) is a technique that ensures equitable pseudo-label distribution to mitigate confirmation bias in imbalanced datasets.
It employs selection-based balancing, reweighting methods, and adaptive thresholding to adjust pseudo-label assignment and improve model performance.
Applications span image classification, semantic segmentation, and graph node classification, achieving significant accuracy gains and reduced annotation needs.

Class-Balanced Pseudo-labeling (CBPL) encompasses a set of methodologies and principles designed to ensure that the pseudo-labels assigned in semi-supervised and active learning frameworks are distributed equitably across all target classes. This paradigm directly addresses the common problem wherein naive pseudo-labeling, especially in settings with skewed class distributions or limited labeled data, amplifies bias toward majority classes. The goal of CBPL is to integrate class balance into the pseudo-label selection, assignment, or weighting process to improve generalization, mitigate confirmation bias, and elevate the recognition performance for minority classes across diverse domains, including image classification, semantic segmentation, object detection, graph node classification, and multi-label learning.

1. Theoretical Foundations and Mechanisms

CBPL is rooted in the observation that standard pseudo-labeling procedures frequently reinforce class imbalance—confident model predictions tend to favor majority classes, leading to confirmation bias and performance degradation on minority classes. This issue emerges even in domains with initially balanced labeled distributions, where the pseudo-labels assigned to unlabeled instances deviate due to the intrinsic data structure or noise amplification (Wang et al., 2022).

The solution approaches in CBPL can be categorized into three principal mechanisms:

Selection-based balancing: Actively selecting pseudo-labeled (or labeled) samples to approximate a desired (often uniform) class distribution using constraints or regularizers in the selection optimization (Bengar et al., 2021).
Reweighting-based balancing: Introducing explicit weight modifications during training, typically scaling loss contributions proportional to inverse class frequency or via learned factors (Peng et al., 2023, Zhang et al., 19 Jul 2024).
Adaptive thresholding: Setting per-class dynamic thresholds for pseudo-label acceptance, so that the pseudo-label class distribution aligns with labeled class proportions or with a theoretically justified target (Xie et al., 2023).

CBPL frameworks can be instantiated at pseudo-labeling time, during model retraining, or within memory bank architectures for long-tailed SSL.

2. Algorithmic Design and Regularization

Several algorithmic strategies have emerged for effective CBPL implementation:

Optimization with balancing constraints: For pool-based active learning, the sample acquisition is posed as a binary programming problem minimizing the sum of informativeness scores (e.g., sample entropy) and an L₁ penalty on class count deviation from the target distribution (Bengar et al., 2021):

$\min_{\mathbf{z}} \mathbf{z}^\top (P \odot \log P)\mathbf{1}_C + \lambda \|\Omega(c) - P^\top \mathbf{z}\|_1$

subject to $\mathbf{z}^\top \mathbf{1}_N = b$ , $z_i \in \{0,1\}$ .

KL-divergence regularization for pseudo-label distributions: In graph semi-supervised learning, minimizing the KL divergence between the empirical pseudo-label class distribution and a target (usually uniform) distribution ensures balance (Li et al., 2022):

$\ell_{KL} = \sum_j p_j \log \left( \frac{p_j}{\overline{f(X)}_j} \right)$

Reweighting factors for rare classes: In memory bank architectures, probabilities for enqueuing, dequeuing, and sampling features are inversely related to class frequency, enforcing underrepresented classes have higher presence during training (Peng et al., 2023):

| Operation | Formula | |---------------|------------------------------------------| | Enqueue | $P^{{in}}_k = 1/(C_k)^{\beta}$ | | Dequeue | $P^{{out}}_k = 1 - 1/(C_k)^{\beta}$ | | Sample (Get) | $P^{{get}}_k = 1/(M_k)^{\lambda}$ |

Counterfactual debiasing and adaptive margin loss: In cases where pseudo-labels are naturally imbalanced (not just due to scarcity but intrinsic confounders), subtracting the log-average class frequency from logits and dynamically adjusting classification margins per class yields balanced supervision without requiring external priors (Wang et al., 2022).
Ranking-based and top-k selection: For intent classification, balanced sets of pseudo-labeled instances are created by top-k selection per class, ranked jointly by prediction confidence and distance in embedding space (Botzer et al., 2023).

3. Applications Across Modalities

CBPL has been applied or adapted in several domains:

Image Classification and SSL: Mixup augmentation and minimum labeled samples per mini-batch regularization reduces confirmation bias and improves CBPL effectiveness (Arazo et al., 2019).
Domain Adaptive Semantic Segmentation: Online pixel-level clustering and optimal transport enforce per-class distribution alignment in the self-labeling regime, significantly improving mean IoU for long-tailed classes (Li et al., 2022).
Graph Node Classification: LLM-based oversampling for minority nodes and dynamically weighted losses underpin robust performance under noisy and imbalanced settings (Xia et al., 24 Jul 2025).
Object Detection (Active Learning): Box-level CBPL selects informative minority-class samples and uses task-aware soft pseudo labeling, outperforming hard labeling and standard uncertainty sampling (Liao et al., 25 Aug 2025).
Multi-label Learning: Class-distribution-aware pseudo labeling via per-class regularized thresholds ensures the pseudo-label distribution matches the data, backed by theoretical generalization guarantees (Xie et al., 2023).
Intent Classification: Balanced, distance-based top-k selection in embedding space avoids majority class overdominance in intent detection (Botzer et al., 2023).
Key Information Extraction: Reweighting and merged prototype clustering jointly address underestimation of tail class confidence and facilitate compact feature clusters (Zhang et al., 19 Jul 2024).

4. Empirical Results and Performance Benefits

Empirical evidence substantiates the efficacy of CBPL:

Image Classification: On CIFAR-10/100, combining mixup and enforced labeled samples per batch led to an error reduction from ~32.10% to ~13.68% for 500 labeled samples (Arazo et al., 2019).
Active Learning: On CIFAR10 with severe imbalance, CBPL reduced annotation requirements by up to 10% for a target accuracy (Bengar et al., 2021). Similar gains (1–3%) are observed for CIFAR100 and Tiny ImageNet.
Long-tailed SSL: Memory-bank based approaches showed up to 8.2% improvement on 1% labeled ImageNet127 and 4.3% on ImageNet-LT for minority classes (Peng et al., 2023).
Domain Adaptive Object Detection: The ReDB framework improves mAP by 23.15% on nuScenes→KITTI (Chen et al., 2023).
Graph Node Classification: LLM-based graph augmentation plus CBPL yields up to 8.03% gain in G-mean on Cora compared to advanced baselines (Xia et al., 24 Jul 2025).
Semantic Segmentation: CPSL achieves a mean IoU of 55.7% on GTA5→Cityscapes, outperforming prior methods, especially for tail classes (Li et al., 2022).
Key Information Extraction: CRMSP attains a 3.24% f1-score improvement over previous state-of-the-art on CORD (Zhang et al., 19 Jul 2024).

5. Limitations and Open Questions

Despite substantial progress, CBPL faces several challenges and opportunities:

Pseudo-label noise: Aggressive balancing can propagate erroneous pseudo-labels, especially in early training or extreme scarcity. Regularization and conservative selection (e.g., via energy (Yu et al., 2022) or ranking (Botzer et al., 2023)) are critical.
Estimation of true class proportions: Many CBPL methods rely on distribution estimates from small labeled sets. Accuracy of such estimates critically determines balance quality; theoretical bounds (e.g., error $\le \mathcal{O}_p(1/\sqrt{n})$ (Xie et al., 2023)) are reassuring but subject to variance.
Computational efficiency: Optimization problems, clustering, and memory operations can be resource-intensive, particularly with large class counts or at pixel/object/box level.
Dynamic domain shifts: In domain adaptation, the target class distribution may drift from source; adaptive mechanisms that do not rely on fixed priors are favorable (Wang et al., 2022).

6. Future Directions

Ongoing and future research in CBPL is exploring:

Integration with advanced uncertainty estimation and calibration, e.g., energy-based selection or Bayesian acquisition in active learning (Yu et al., 2022, Bengar et al., 2021).
Prototype-based and multicentric representations, to handle intra-class diversity and further reduce negative transfer (Qu et al., 2022).
Contrastive and semantic feature alignment for tail classes in natural language and vision tasks (Zhang et al., 19 Jul 2024).
Memory bank scalability and adaptive weighting to enable deployment in large-scale or real-time applications (Peng et al., 2023).
Extensions to multi-label, multi-task, and graph-based scenarios, with theoretical analysis on generalization and convergence properties (Xie et al., 2023, Xia et al., 24 Jul 2025).

7. Summary Table: CBPL Mechanisms and Key Components

Mechanism	Representative Equation / Algorithm	Key Paper(s)
Selection-based	$\min_{\mathbf{z}}$ informativeness $+$ $\lambda$ balance term	(Bengar et al., 2021)
Reweighting-based	$P^{{in}}_k = 1/(C_k)^{\beta}$ , adaptive loss weights	(Peng et al., 2023, Zhang et al., 19 Jul 2024)
Adaptive thresholding	Per-class $\tau(\alpha_k), \tau(\beta_k)$ thresholds	(Xie et al., 2023)
Counterfactual debiasing	$\tilde{f}_i = f(\alpha(x_i)) - \lambda \log \hat{p}$	(Wang et al., 2022)
KL-divergence reg.	$\ell_{KL} = \sum_j p_j \log \frac{p_j}{\overline{f(X)}_j}$	(Li et al., 2022)
Top-k ranking	Balanced per-class selection via combined score	(Botzer et al., 2023)

CBPL principles have become foundational to state-of-the-art semi-supervised and active learning systems, sharply improving minority class recognition and generalization under realistic imbalanced and long-tailed data distributions.