Papers
Topics
Authors
Recent
Search
2000 character limit reached

Robustness to Label Noise

Updated 8 February 2026
  • Robustness to Label Noise is the capacity of algorithms to maintain generalization when trained on datasets with misannotated labels.
  • Theoretical foundations include symmetric noise models and noise transition matrices that guide the design of robust classifiers.
  • Practical strategies such as robust loss functions, sample selection, and contrastive learning improve performance in noisy labeling scenarios.

Robustness to label noise refers to the property of a learning algorithm to maintain high generalization performance when trained on datasets where some fraction of the labels are incorrectly annotated. This robustness is essential in real-world scenarios where perfect labeling is often infeasible due to limitations of human annotators, automated data collection, or inherent class ambiguity. The study of robustness encompasses both foundational theoretical analyses—defining thresholds and mechanisms by which learning can resist label noise—and the development of practical algorithms that maintain task performance under various noise models.

1. Formal Definitions and Noise Models

Label noise models are characterized by how they modify the true-label distribution. The standard multiclass symmetric noise model assumes that, for each example (X,Y)(X, Y) sampled i.i.d. from a distribution FX,YF_{X,Y}, the observed label ZZ equals the true label YY with probability 1α1-\alpha and is uniformly switched to one of the K1K-1 remaining classes with probability α/(K1)\alpha/(K-1). This gives rise to a noise transition matrix AA:

Aij={1α,i=j α/(K1),ijA_{ij} = \begin{cases} 1-\alpha, & i=j \ \alpha/(K-1), & i\neq j \end{cases}

More generally, noise can be class-dependent (transition probabilities depend only on the class, not on the instance), or fully instance-dependent, with a general transition probability η(x,yy)\eta(x, y \rightarrow y'). Uniform (feature-, class-independent) noise is a special case. The most adverse scenario is feature-dependent noise concentrated near decision boundaries, where even small overall noise rates can have severe impact (Oyen et al., 2022).

For binary and multiclass settings, the critical parameter is the noise rate at which theoretical robustness breaks down (the "tipping point"). For symmetric noise in KK classes, the threshold is α=(K1)/K\alpha^* = (K-1)/K. Above this, even the Bayes-optimal classifier cannot maintain its original decisions (Priebe et al., 2022, Oyen et al., 2022).

2. Theoretical Foundations of Robustness

The central theorem for symmetric label noise asserts that if labels are flipped symmetrically with probability α<(K1)/K\alpha < (K-1)/K, then any L1L_1-consistent plug-in classifier trained via empirical noisy posteriors achieves Bayes-optimal error asymptotically, regardless of whether α\alpha is known (Priebe et al., 2022). The argument exploits the invertibility of the symmetric noise matrix AA and the monotonicity of argmax under affine transformation (i.e., argmaxkqk(x)=argmaxkpk(x)\arg\max_k q_k(x) = \arg\max_k p_k(x)).

Notably, this robustness applies to both under-parameterized (e.g., where the number of parameters grows sublinearly with sample size and classical universal consistency holds) and over-parameterized (e.g., ReLU nets where number of parameters exceeds nn but, with correct optimization, universal consistency is established) neural networks. Thus, the property is architectural-regime-independent provided the posterior estimation is L1L_1-consistent.

For instance-dependent noise, or noise "targeted" near the decision boundary in feature space, robustness can degrade rapidly even at low noise rates. The shape of η(x,yy)\eta(x, y \rightarrow y')—not just its average rate—controls the margin decay and the failure of decision-boundary preservation. This underlines the limitation of classical robustness theory when the feature-dependent noise model is not uniform (Oyen et al., 2022).

3. Loss Functions and Algorithmic Strategies

A loss function’s robustness to label noise can be established by risk minimization properties. Strictly proper losses (e.g., cross-entropy) are robust in the sense that, under symmetric noise, the induced decision boundary is unchanged, although the model is calibrated to the induced noisy distribution rather than the true posterior (Olmin et al., 2021). Symmetric (or “noise-insensitive”) losses—for which the sum over all classes is constant for any prediction—are also robust in accuracy, but never strictly proper, and thus, cannot ensure calibrated uncertainties under noise.

A recent unifying framework is the ff-divergence–based Posterior Maximization Learning (f-PML). For any ff, these objectives are provably robust to symmetric label noise for all ff, including classic cross-entropy, without the need for loss correction or noise estimation (Novello et al., 9 Apr 2025). For more complex noise, explicit correction—either at the training or at the posterior estimation stage—can fully restore the clean-data optimum if the noise model is known or reliably estimated.

Practical loss constructions include automatically learned robust objectives (e.g., Taylor-polynomial parameterizations (Gao et al., 2021)), which meta-learn loss function coefficients to flatten gradients on likely mislabeled examples and induce softer minima. Other strategies penalize the network Jacobian norm to suppress memorization of noise, thereby encouraging smoothness in the learned function (Luo et al., 2019). Consistency regularization, by enforcing agreement under augmentation, acts as a local smoothing prior, discouraging fits to noisy labels (Englesson et al., 2021).

4. Sample Selection, Pseudo-Labeling, and Contrastive Methods

Real-world label noise is rarely perfectly symmetric; empirical systems therefore combine robust loss design with explicit sample selection and pseudo-labeling. Methods employ loss-based mixture modeling to identify low-loss (likely clean) samples, assign them higher weight, and treat likely noisy samples as unlabeled in a semi-supervised learning setup (e.g., DRPL (Ortego et al., 2019), MOIT (Ortego et al., 2020), PLS (Albert et al., 2022)).

These approaches iteratively refine clean set estimates, often through multiple rounds of relabeling and semi-supervised learning. Robustness is further enhanced using interpolated or contrastive objectives, which leverage robust feature representations to accurately detect and subsequently downweight or ignore mislabelled data points, and employ mixup strategies to prevent confirmation bias during pseudo-labeling (Ortego et al., 2020, Albert et al., 2022).

Self-supervised and semi-supervised pretraining—particularly via contrastive learning—provides robust initial representations that delay or prevent the memorization of noisy labels by the classifier, allowing loss-corrected or sample-weighted objectives to succeed even at extreme noise rates (Ghosh et al., 2021).

5. Loss Correction via Noise Transition Modeling

Explicitly modeling the noise transition matrix TT (where TijT_{ij} gives the probability that true label ii is observed as jj) enables unbiased loss correction via “backward correction” (T1T^{-1} acting on the vector of per-class losses) (Yang et al., 2022). This expands to both symmetric and complex class-dependent noise regimes, provided TT is nonsingular and can be estimated (anchor point estimation is commonly used). Robust accuracy improves most substantially when the noise is severe and TT is well-estimated. Forward correction methods, which correct the prediction probabilities before loss computation, are also widely used.

Small “trusted” clean sets provide data-efficient means to estimate TT or to adversarially infer a label-correcting transformation (e.g., TrustNet, which uses an adversarial learning scheme and entropy-based sample weights to interpolate between trusting observed and inferred labels) (Ghiassi et al., 2020).

6. Robustness in Decision Trees, Boosting, and Conformal Prediction

Theoretical and empirical results demonstrate that top-down binary decision-tree learning—using Gini impurity, misclassification impurity, or twoing-rule splitting—remains robust to symmetric label noise, provided sufficient samples at each node (Ghosh et al., 2016). The same holds for majority-vote prediction at leaves. Sample complexity increases as the noise rate approaches 0.5, but the structure of the tree and the prediction rules remain unchanged in the large-sample limit.

Ensemble methods such as AdaBoost are highly sensitive to label noise; robustness requires modifying the loss from a convex to a non-convex function of the margin (as in rBoost), and/or employing base learners that explicitly model label flipping (Bootkrajang et al., 2013).

Conformal prediction methods remain robust under label noise if the noise is dispersive, i.e., spreads out the non-conformity scores and does not concentrate risk. In adversarial or bounded noise settings, quantile adjustments guarantee valid risk control (Einbinder et al., 2022).

7. Empirical Regimes and Practical Recommendations

Empirical findings consistently indicate that high-capacity classifiers—particularly DNNs—can memorize random labels, leading to sharp test-accuracy drop when learning is not properly regularized. However, for symmetric noise below the theoretical threshold (K1)/K(K-1)/K, DNNs trained with standard objectives and no explicit mitigation can remain Bayes-optimal if the estimator is L1L_1-consistent (Priebe et al., 2022). Regularization methods such as early stopping, weight decay, and consistency regularization help delay overfitting, and the effect is intensified when combined with robust losses, strong augmentations, or semi-supervised relabeling.

Feature-dependent noise sharply deteriorates robustness. Under such regimes, all evaluated robust-loss and sample-selection techniques fail unless directly supplied with clean labels or explicit noise models (Oyen et al., 2022). For tasks where reliable uncertainty quantification is crucial, neither strictly proper nor robust losses guarantee calibrated probabilistic outputs under noisy supervision; post-hoc calibration or clean validation sets are necessary (Olmin et al., 2021).

Practical guidelines include:

References

All referenced results, definitions, and empirical findings are derived from the following key works:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Robustness to Label Noise.