Robust Loss Functions under Label Noise for Deep Neural Networks (1712.09482v1)

Published 27 Dec 2017 in stat.ML and cs.LG

Abstract: In many applications of classifier learning, training data suffers from label noise. Deep networks are learned using huge training data where the problem of noisy labels is particularly relevant. The current techniques proposed for learning deep networks under label noise focus on modifying the network architecture and on algorithms for estimating true labels from noisy labels. An alternate approach would be to look for loss functions that are inherently noise-tolerant. For binary classification there exist theoretical results on loss functions that are robust to label noise. In this paper, we provide some sufficient conditions on a loss function so that risk minimization under that loss function would be inherently tolerant to label noise for multiclass classification problems. These results generalize the existing results on noise-tolerant loss functions for binary classification. We study some of the widely used loss functions in deep networks and show that the loss function based on mean absolute value of error is inherently robust to label noise. Thus standard back propagation is enough to learn the true classifier even under label noise. Through experiments, we illustrate the robustness of risk minimization with such loss functions for learning neural networks.

Citations (879)

View on Semantic Scholar

Summary

The paper derives sufficient conditions for noise-tolerant loss functions in multiclass classification, extending theory from binary settings.
The methodology combines theoretical proofs for symmetric, non-uniform, and class-conditional noise with empirical validation on datasets like MNIST, CIFAR-10, and RCV1.
Results demonstrate that networks trained with MAE maintain high accuracy under severe noise, achieving resilience up to 80% label noise.

Robust Loss Functions under Label Noise for Deep Neural Networks

This paper addresses the challenge of learning deep neural networks from datasets affected by label noise, a common problem in large-scale classification tasks where data labeling often involves human error or the use of unreliable sources. The paper investigates the robustness of different loss functions within the risk minimization framework and derives sufficient conditions under which these loss functions are robust in the presence of label noise.

Summary of Key Contributions

Generalization to Multiclass Classification: The primary contribution is the extension of existing theoretical results for noise-tolerant loss functions from binary to multiclass classification problems. The authors derive sufficient conditions for a loss function to be inherently noise-tolerant under varying types of label noise (symmetric, simple non-uniform, and class-conditional).
Robustness Conditions: The paper provides formal proofs that establish sufficient conditions for a loss function to be noise-tolerant:
- Symmetric Noise: For a loss function to be tolerant under symmetric label noise, it must be symmetric in the sense that the sum of losses for all possible classes is constant.
- Simple Non-uniform Noise: For loss functions under simple non-uniform noise, the condition is that the minimal value of the noise-free risk must be zero.
- Class-conditional Noise: Under class-conditional noise, robustness is guaranteed if the loss function is symmetric and satisfies a specific bound condition.
Empirical Validation: Experimental results demonstrate that the Mean Absolute Error (MAE) loss function is inherently robust to label noise. The authors compare MAE with other common loss functions such as Mean Squared Error (MSE) and Categorical Cross Entropy (CCE) on multiple datasets (e.g., MNIST, CIFAR-10, RCV1) under varying noise conditions.

Strong Numerical Results

Extensive empirical tests validate the theoretical insights:

MAE Performance: The results show that networks trained with MAE maintain high accuracy even under severe label noise (up to 80%), outperforming CCE and MSE significantly.
Consistency and Convergence: Theorems validate that empirical risk minimization using noise-tolerant loss functions is consistent and converges uniformly to the true error rate, even when trained on noisy datasets.

Implications and Future Work

Practical Implications:

Choice of Loss Functions: For practitioners, the findings suggest using MAE or other symmetric loss functions when training deep neural networks with noisy labels to achieve better robustness without altering the standard backpropagation framework.
Optimizing MAE: While MAE demonstrates robustness, the training process using MAE can be slower due to gradient saturation. This paper’s insights prompt the need for optimized algorithms or alternative implementations that mitigate such practical challenges.

Theoretical Insights:

Noise Robustness in Multiclass Settings: The paper fills a critical gap in understanding loss function behavior in multiclass scenarios, providing a robust foundational theory for future work in this domain.

Future Developments:

Algorithmic Improvements: Developing specialized optimization techniques tailored for symmetric loss functions like MAE to enhance training efficiency.
Wider Applications: Exploring the application of these robust loss functions in other machine learning models and tasks, beyond neural networks, to generalize the findings further.

In conclusion, this paper offers significant contributions to both the theoretical and practical aspects of machine learning under label noise. By establishing sufficient conditions for noise-tolerant loss functions and empirically validating their robustness, it provides a concrete foundation for further research and application in real-world scenarios plagued by noisy labels.

PDF Markdown