Label Shift in Machine Learning

Updated 19 February 2026

Label shift is a phenomenon where the marginal distribution of class labels changes from training to testing, while the class-conditional feature distributions stay invariant.
Robust estimation methods such as maximum likelihood, moment-matching, and Bayesian techniques enable effective adaptation and unbiased risk evaluation under label shift.
Extensions like Conditional Probability Shift, open set adaptations, and federated learning cases reveal practical challenges and broaden the scope of traditional label shift correction.

Label shift, also known as prior probability shift, refers to the regime in supervised and semi-supervised learning where the marginal distribution over class labels changes from the training (“source”) domain to the test (“target”) domain, while the class-conditional feature distributions remain invariant. Formally, under label shift, $p_{\mathrm{train}}(y) \ne p_{\mathrm{test}}(y)$ , but $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ . This phenomenon is common in numerous real-world applications, such as clinical deployment, federated learning, and high-dimensional distributed systems, and has led to a significant body of research on theoretical guarantees, estimation methodology, and principled adaptation mechanisms.

1. Formal Problem Definition and Identifiability

The fundamental assumption of label shift is that, for all $y$ in the label set $\mathcal Y$ ,

$p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y), \quad\quad p_{\mathrm{train}}(y) \neq p_{\mathrm{test}}(y)$

This is in contrast to covariate shift, where $p(x)$ changes but $p(y|x)$ is preserved.

Identifiability of $p_{\mathrm{test}}(y)$ from unlabeled target data requires linear independence of the collection $\{p_{\mathrm{train}}(x|y=i)\}_{i=1}^K$ and invertibility of the conditional confusion matrix for black-box adaptation schemes (Chen et al., 2022). The importance weighting ratio $w(y) = p_{\mathrm{test}}(y) / p_{\mathrm{train}}(y)$ , when consistently estimated, enables unbiased risk evaluation and posterior adjustment on the target domain even in the absence of target labels (Chow, 2022, Alexandari et al., 2019).

2. Robust Estimation and Adaptation Algorithms

Maximum Likelihood and Semiparametric Approaches

The canonical approach is maximum likelihood estimation (MLE) of the target label proportions using either direct mixture likelihood or the EM algorithm on unlabeled target data, leveraging the relationship:

$p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 0

Under well-specified models and sufficient sample size, the MLE is $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 1-consistent and achieves asymptotically minimal variance (Lecestre, 5 Feb 2025, Chow, 2022). Recent work demonstrates that this estimator coincides with robust $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 2-estimators, providing deviation bounds and showing controlled breakdown in the presence of moderate contamination or adversarial outliers (Lecestre, 5 Feb 2025). Bias-corrected calibration of classifier outputs is essential for practical effectiveness in deep learning contexts (Alexandari et al., 2019).

Moment-Matching and Influence Function Geometry

Efficient semiparametric estimators such as ELSA exploit influence function geometry to formulate a moment-matching system, avoiding post-hoc probability calibration. The estimator solves for weight parameters $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 3 such that empirical feature moments under the source distribution, reweighted by $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 4, align with the corresponding moments of the target distribution (Tian et al., 2023). This approach achieves $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 5-consistency and supports linear system solutions for adaptation weights.

Bayesian and Online Label Shift Quantification

Bayesian estimators treat the unknown test prior as a random variable with a Dirichlet prior, leading to maximum a posteriori estimation via EM. Dynamic and online extensions (online-FMAPLS) employ stochastic E/M steps and a linear surrogate for hyperparameter updates, enabling adaptation to streaming environments and class-imbalance in large-scale settings, while trading asymptotic accuracy for convergence rate depending on the Dirichlet scaling constant (Hu et al., 23 Nov 2025).

3. Generalizations and Extensions of Label Shift

Conditional Probability Shift and Generalized Label Shift

Research has highlighted that label shift does not capture all realistic domain discrepancies. The Conditional Probability Shift Model (CPSM) generalizes label shift by allowing the conditional label law $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 6 to change as a function of covariates $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 7, while enforcing invariance of $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 8. Standard LS-correction methods fail when $p_{\mathrm{train}}(x|y) = p_{\mathrm{test}}(x|y)$ 9 yet $y$ 0; CPSM resolves this via EM on a conditional multinomial regression (Teisseyre et al., 4 Mar 2025).

Generalized Label Shift (GLS) further considers joint changes in class priors and class-conditional covariate distributions, encompassing classical label shift, covariate shift, and intermediate settings. The Minimum Uncertainty Learning (MUL) framework aligns conditional feature distributions in RKHS and balances transfer and decision uncertainty, resulting in improved generalization error bounds and empirical performance (Luo et al., 2022).

Open Set Label Shift and Survival Analysis

Open Set Label Shift (OSLS) extends label shift to permit the appearance of new, previously unseen classes in the target domain. Algorithms must estimate both the new class prevalence and adapt classifiers using a fusion of label-shift correction on known classes and PU learning for the novel class, with identifiability guaranteed under conditions of feature support separability (Garg et al., 2022).

In survival analysis, label shift corresponds to differences in the marginal distribution of event times $y$ 1 between training and test populations, while the conditional distribution of covariates given $y$ 2 is preserved. Semiparametric likelihood methods combining nonparametric estimation of $y$ 3 and empirical plug-in of marginal covariate distributions provide consistent, asymptotically normal inference in target populations even under right-censoring (Zong et al., 26 Jun 2025).

4. Algorithmic Frameworks and Practical Considerations

Abstention, Test-Time, and Distributed/Federated Adaptation

Metric-aware abstention under label shift—both at training and test time—requires recalibration of posterior probabilities and expected performance metrics (e.g., auROC, sensitivity at specificity) after adaptation (Alexandari et al., 2018). Naїve entropy or confidence-based rejection rules perform poorly under nonuniform shifts unless the decision thresholds are adapted using calibrated or EM-corrected posteriors.

Test-Time Adaptation (TTA) frameworks have introduced modular "label shift adapters" that estimate the target label prior via exponential moving averages of soft predictions and produce batch-conditional parameter corrections via hypernetwork architectures. Integration of these adapters into entropy-minimization TTA yields significant robustness gains in scenarios with both covariate and label shift (Park et al., 2023).

In federated or distributed settings, heterogeneity in class proportions (label shift) across clients severely impairs standard methods such as FedAvg. Methods such as FedPALS correct for target-specific risk by projecting aggregation weights to best match the target label distribution under size and variance constraints, yielding unbiased SGD updates when the target prior lies in the convex hull of client marginals (Zec et al., 2024). In high-dimensional settings, label-shift–robust federated feature screening (LR-FFS) constructs client-invariant screening utilities that avoid dependence on local priors, achieving sure screening and FDR control even under extreme heterogeneity (Qin et al., 31 May 2025).

Entropy-regularized predictors (VRLS) further improve calibration for MLE-based ratio estimation and importance-weighted risk minimization, scaling efficiently to multi-node architectures with minimal communication cost (Wu et al., 4 Feb 2025).

5. Theoretical Guarantees and Minimax Optimality

Label shift estimation achieves minimax optimal rates in both supervised and unsupervised settings. In nonparametric classification, the excess risk decomposes into contributions from (i) estimation of class-conditional densities and (ii) estimation of class priors, with the latter dominating when the number of unlabeled target samples is small. Plug-in classifiers based on density estimation and proportion matching (via MLE or distributional M-estimation) attain optimal rates for both semi-supervised and unsupervised transfer settings (Maity et al., 2020).

In robust estimation frameworks, deviation bounds scale as $y$ 4 in the well-specified case and degrade gracefully in the presence of contamination or adversarial outliers. Theoretical analyses have established semiparametric efficiency, information lower bounds, and convergence guarantees for practical algorithms (Lecestre, 5 Feb 2025, Chow, 2022).

Distributionally robust optimization (DRO) formulations train models to minimize worst-case expected target risk over test priors within an $y$ 5-divergence ball around the source prior, providing explicit robustness to unseen and possibly extreme label distributions (Zhang et al., 2020).

6. Limitations, Open Challenges, and Future Directions

No single method subsumes all forms of dataset shift. Standard label shift correction fails under conditional shifts with $y$ 6 drift. Extensions such as CPSM, GLS, or SJS provide broader modeling power, but often at the cost of additional assumptions or estimation complexity (Teisseyre et al., 4 Mar 2025, Luo et al., 2022, Chen et al., 2022). Real-world federated systems exhibit both label and covariate shift, necessitating adapters that estimate and correct both forms simultaneously (Park et al., 2023). Further challenges include the scalability of density-ratio estimation to large label spaces, streaming and nonstationary environments, and the efficient integration of shift-adaptive screening and risk estimation into end-to-end deep learning pipelines (Hu et al., 23 Nov 2025, Wu et al., 4 Feb 2025, Qin et al., 31 May 2025).

Theoretical frontiers include refined generalization error bounds accounting for finite-sample estimation and estimation error propagation through moment-matching, bilevel optimization, and importance-weighted empirical risk. Addressing non-linear, structured, and open-world forms of label shift, as well as the interplay of shift adaptation with fairness, privacy, and distributed resource constraints, remains a subject of active research.