Robustness to Label Noise

Updated 8 February 2026

Robustness to Label Noise is the capacity of algorithms to maintain generalization when trained on datasets with misannotated labels.
Theoretical foundations include symmetric noise models and noise transition matrices that guide the design of robust classifiers.
Practical strategies such as robust loss functions, sample selection, and contrastive learning improve performance in noisy labeling scenarios.

Robustness to label noise refers to the property of a learning algorithm to maintain high generalization performance when trained on datasets where some fraction of the labels are incorrectly annotated. This robustness is essential in real-world scenarios where perfect labeling is often infeasible due to limitations of human annotators, automated data collection, or inherent class ambiguity. The study of robustness encompasses both foundational theoretical analyses—defining thresholds and mechanisms by which learning can resist label noise—and the development of practical algorithms that maintain task performance under various noise models.

1. Formal Definitions and Noise Models

Label noise models are characterized by how they modify the true-label distribution. The standard multiclass symmetric noise model assumes that, for each example $(X, Y)$ sampled i.i.d. from a distribution $F_{X,Y}$ , the observed label $Z$ equals the true label $Y$ with probability $1-\alpha$ and is uniformly switched to one of the $K-1$ remaining classes with probability $\alpha/(K-1)$ . This gives rise to a noise transition matrix $A$ :

$A_{ij} = \begin{cases} 1-\alpha, & i=j \ \alpha/(K-1), & i\neq j \end{cases}$

More generally, noise can be class-dependent (transition probabilities depend only on the class, not on the instance), or fully instance-dependent, with a general transition probability $\eta(x, y \rightarrow y')$ . Uniform (feature-, class-independent) noise is a special case. The most adverse scenario is feature-dependent noise concentrated near decision boundaries, where even small overall noise rates can have severe impact (Oyen et al., 2022).

For binary and multiclass settings, the critical parameter is the noise rate at which theoretical robustness breaks down (the "tipping point"). For symmetric noise in $K$ classes, the threshold is $\alpha^* = (K-1)/K$ . Above this, even the Bayes-optimal classifier cannot maintain its original decisions (Priebe et al., 2022, Oyen et al., 2022).

2. Theoretical Foundations of Robustness

The central theorem for symmetric label noise asserts that if labels are flipped symmetrically with probability $\alpha < (K-1)/K$ , then any $L_1$ -consistent plug-in classifier trained via empirical noisy posteriors achieves Bayes-optimal error asymptotically, regardless of whether $\alpha$ is known (Priebe et al., 2022). The argument exploits the invertibility of the symmetric noise matrix $A$ and the monotonicity of argmax under affine transformation (i.e., $\arg\max_k q_k(x) = \arg\max_k p_k(x)$ ).

Notably, this robustness applies to both under-parameterized (e.g., where the number of parameters grows sublinearly with sample size and classical universal consistency holds) and over-parameterized (e.g., ReLU nets where number of parameters exceeds $n$ but, with correct optimization, universal consistency is established) neural networks. Thus, the property is architectural-regime-independent provided the posterior estimation is $L_1$ -consistent.

For instance-dependent noise, or noise "targeted" near the decision boundary in feature space, robustness can degrade rapidly even at low noise rates. The shape of $\eta(x, y \rightarrow y')$ —not just its average rate—controls the margin decay and the failure of decision-boundary preservation. This underlines the limitation of classical robustness theory when the feature-dependent noise model is not uniform (Oyen et al., 2022).

3. Loss Functions and Algorithmic Strategies

A loss function’s robustness to label noise can be established by risk minimization properties. Strictly proper losses (e.g., cross-entropy) are robust in the sense that, under symmetric noise, the induced decision boundary is unchanged, although the model is calibrated to the induced noisy distribution rather than the true posterior (Olmin et al., 2021). Symmetric (or “noise-insensitive”) losses—for which the sum over all classes is constant for any prediction—are also robust in accuracy, but never strictly proper, and thus, cannot ensure calibrated uncertainties under noise.

A recent unifying framework is the $f$ -divergence–based Posterior Maximization Learning (f-PML). For any $f$ , these objectives are provably robust to symmetric label noise for all $f$ , including classic cross-entropy, without the need for loss correction or noise estimation (Novello et al., 9 Apr 2025). For more complex noise, explicit correction—either at the training or at the posterior estimation stage—can fully restore the clean-data optimum if the noise model is known or reliably estimated.

Practical loss constructions include automatically learned robust objectives (e.g., Taylor-polynomial parameterizations (Gao et al., 2021)), which meta-learn loss function coefficients to flatten gradients on likely mislabeled examples and induce softer minima. Other strategies penalize the network Jacobian norm to suppress memorization of noise, thereby encouraging smoothness in the learned function (Luo et al., 2019). Consistency regularization, by enforcing agreement under augmentation, acts as a local smoothing prior, discouraging fits to noisy labels (Englesson et al., 2021).

4. Sample Selection, Pseudo-Labeling, and Contrastive Methods

Real-world label noise is rarely perfectly symmetric; empirical systems therefore combine robust loss design with explicit sample selection and pseudo-labeling. Methods employ loss-based mixture modeling to identify low-loss (likely clean) samples, assign them higher weight, and treat likely noisy samples as unlabeled in a semi-supervised learning setup (e.g., DRPL (Ortego et al., 2019), MOIT (Ortego et al., 2020), PLS (Albert et al., 2022)).

These approaches iteratively refine clean set estimates, often through multiple rounds of relabeling and semi-supervised learning. Robustness is further enhanced using interpolated or contrastive objectives, which leverage robust feature representations to accurately detect and subsequently downweight or ignore mislabelled data points, and employ mixup strategies to prevent confirmation bias during pseudo-labeling (Ortego et al., 2020, Albert et al., 2022).

Self-supervised and semi-supervised pretraining—particularly via contrastive learning—provides robust initial representations that delay or prevent the memorization of noisy labels by the classifier, allowing loss-corrected or sample-weighted objectives to succeed even at extreme noise rates (Ghosh et al., 2021).

5. Loss Correction via Noise Transition Modeling

Explicitly modeling the noise transition matrix $T$ (where $T_{ij}$ gives the probability that true label $i$ is observed as $j$ ) enables unbiased loss correction via “backward correction” ( $T^{-1}$ acting on the vector of per-class losses) (Yang et al., 2022). This expands to both symmetric and complex class-dependent noise regimes, provided $T$ is nonsingular and can be estimated (anchor point estimation is commonly used). Robust accuracy improves most substantially when the noise is severe and $T$ is well-estimated. Forward correction methods, which correct the prediction probabilities before loss computation, are also widely used.

Small “trusted” clean sets provide data-efficient means to estimate $T$ or to adversarially infer a label-correcting transformation (e.g., TrustNet, which uses an adversarial learning scheme and entropy-based sample weights to interpolate between trusting observed and inferred labels) (Ghiassi et al., 2020).

6. Robustness in Decision Trees, Boosting, and Conformal Prediction

Theoretical and empirical results demonstrate that top-down binary decision-tree learning—using Gini impurity, misclassification impurity, or twoing-rule splitting—remains robust to symmetric label noise, provided sufficient samples at each node (Ghosh et al., 2016). The same holds for majority-vote prediction at leaves. Sample complexity increases as the noise rate approaches 0.5, but the structure of the tree and the prediction rules remain unchanged in the large-sample limit.

Ensemble methods such as AdaBoost are highly sensitive to label noise; robustness requires modifying the loss from a convex to a non-convex function of the margin (as in rBoost), and/or employing base learners that explicitly model label flipping (Bootkrajang et al., 2013).

Conformal prediction methods remain robust under label noise if the noise is dispersive, i.e., spreads out the non-conformity scores and does not concentrate risk. In adversarial or bounded noise settings, quantile adjustments guarantee valid risk control (Einbinder et al., 2022).

7. Empirical Regimes and Practical Recommendations

Empirical findings consistently indicate that high-capacity classifiers—particularly DNNs—can memorize random labels, leading to sharp test-accuracy drop when learning is not properly regularized. However, for symmetric noise below the theoretical threshold $(K-1)/K$ , DNNs trained with standard objectives and no explicit mitigation can remain Bayes-optimal if the estimator is $L_1$ -consistent (Priebe et al., 2022). Regularization methods such as early stopping, weight decay, and consistency regularization help delay overfitting, and the effect is intensified when combined with robust losses, strong augmentations, or semi-supervised relabeling.

Feature-dependent noise sharply deteriorates robustness. Under such regimes, all evaluated robust-loss and sample-selection techniques fail unless directly supplied with clean labels or explicit noise models (Oyen et al., 2022). For tasks where reliable uncertainty quantification is crucial, neither strictly proper nor robust losses guarantee calibrated probabilistic outputs under noisy supervision; post-hoc calibration or clean validation sets are necessary (Olmin et al., 2021).

Practical guidelines include:

Employ $L_1$ -consistent estimators and robust losses for symmetric noise with $\alpha < (K-1)/K$ (Priebe et al., 2022, Novello et al., 9 Apr 2025).
Use sample selection, relabeling, and semi-supervised methods for more complex (class-dependent, feature-dependent) noise (Albert et al., 2022, Ortego et al., 2020, Ortego et al., 2019).
For highly non-uniform or feature-dependent noise, acquire and leverage a small trusted subset to calibrate correction models (Ghiassi et al., 2020, Yang et al., 2022).
Combine contrastive/self-supervised pretraining with robust losses when facing very high or unknown noise rates (Ghosh et al., 2021).
In critical applications, explicitly monitor and recalibrate predictive uncertainties with clean validation data (Olmin et al., 2021).

References

All referenced results, definitions, and empirical findings are derived from the following key works:

"Deep Learning is Provably Robust to Symmetric Label Noise" (Priebe et al., 2022)
"Robust Classification with Noisy Labels Based on Posterior Maximization" (Novello et al., 9 Apr 2025)
"Robustness to Label Noise Depends on the Shape of the Noise Distribution in Feature Space" (Oyen et al., 2022)
"On the Robustness of Decision Tree Learning under Label Noise" (Ghosh et al., 2016)
"Establishment of Neural Networks Robust to Label Noise" (Yang et al., 2022)
"Label Noise Robustness of Conformal Prediction" (Einbinder et al., 2022)
"Searching for Robustness: Loss Learning for Noisy Classification Tasks" (Gao et al., 2021)
"A Simple yet Effective Baseline for Robust Deep Learning with Noisy Labels" (Luo et al., 2019)
"Contrastive Learning Improves Model Robustness Under Label Noise" (Ghosh et al., 2021)
"PLS: Robustness to label noise with two stage detection" (Albert et al., 2022)
"Towards Robust Learning with Different Label Noise Distributions" (Ortego et al., 2019)
"Multi-Objective Interpolation Training for Robustness to Label Noise" (Ortego et al., 2020)
"TrustNet: Learning from Trusted Data Against (A)symmetric Label Noise" (Ghiassi et al., 2020)
"Boosting in the presence of label noise" (Bootkrajang et al., 2013)
"Consistency Regularization Can Improve Robustness to Label Noise" (Englesson et al., 2021)
"Epoch-wise label attacks for robustness against label noise" (Guendel et al., 2019)
"Open-set Label Noise Can Improve Robustness Against Inherent Label Noise" (Wei et al., 2021)
"Training Classifiers that are Universally Robust to All Label Noise Levels" (Xu et al., 2021)
"Robustness and Reliability When Training With Noisy Labels" (Olmin et al., 2021)
"Towards Robustness to Label Noise in Text Classification via Noise Modeling" (Garg et al., 2021)