Regroup Median Loss for Combating Label Noise (2312.06273v1)

Published 11 Dec 2023 in cs.LG and cs.CV

Abstract: The deep model training procedure requires large-scale datasets of annotated data. Due to the difficulty of annotating a large number of samples, label noise caused by incorrect annotations is inevitable, resulting in low model performance and poor model generalization. To combat label noise, current methods usually select clean samples based on the small-loss criterion and use these samples for training. Due to some noisy samples similar to clean ones, these small-loss criterion-based methods are still affected by label noise. To address this issue, in this work, we propose Regroup Median Loss (RML) to reduce the probability of selecting noisy samples and correct losses of noisy samples. RML randomly selects samples with the same label as the training samples based on a new loss processing method. Then, we combine the stable mean loss and the robust median loss through a proposed regrouping strategy to obtain robust loss estimation for noisy samples. To further improve the model performance against label noise, we propose a new sample selection strategy and build a semi-supervised method based on RML. Compared to state-of-the-art methods, for both the traditionally trained and semi-supervised models, RML achieves a significant improvement on synthetic and complex real-world datasets. The source code of the paper has been released.

References (50)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel Regroup Median Loss approach that improves clean sample selection by regrouping mean and median loss values.
It combines a unique sample selection strategy with robust loss estimation to effectively diminish the influence of noisy labels.
Empirical results show the method outperforms existing techniques on both synthetic and real-world datasets, enhancing model accuracy.

Regroup Median Loss: A Novel Approach for Enhuring Robustness to Label Noise

Introduction

Label noise in large-scale datasets poses a significant challenge in training accurate deep learning models. The traditional small-loss assumption, positing that clean samples contribute to smaller loss values compared to noisy samples, has been a cornerstone in designing methods to mitigate the effects of label noise. However, the overlap in loss values between clean and noisy samples limits the efficiency of these approaches. To address these shortcomings, we introduce the Regroup Median Loss (RML) method. RML enhances the selection of clean samples and offers a robust loss estimation strategy that diminishes the influence of noisy labels on model training.

RML Methodology

RML is articulated around two main processes: a novel sample selection strategy and a robust loss estimation method.

Sample Selection Strategy

The sample selection strategy is based on a novel loss processing method aimed at diminishing the selection probability of noisy samples. In essence, for each training sample, other samples with the same observed label are selected, and their losses are processed using a formula that significantly reduces the likelihood of choosing noisy samples. This step ensures that the selected samples are predominantly clean, hence offering a purer dataset for loss estimation.

Robust Loss Estimation

The second part of RML involves combining the mean and median loss through a strategic regrouping approach to achieve a robust loss estimation for noisy samples. By dividing selected samples into groups, calculating the mean loss for each group, and obtaining the median of these mean losses, RML ensures a stable and more robust loss estimation. This strategy corrects distorted losses effectively, mitigating the impact of label noise on model training.

Theoretical Analysis and Practical Implications

A theoretical analysis demonstrates the robustness of the Regroup Median Loss, providing a foundation for its efficacy in combating label noise. The empirical results underscore RML's superiority over state-of-the-art methods across various datasets, including those with synthetic and real-world label noise. RML not only improves the accuracy of models trained on datasets with a high rate of label noise but also enhances model generalization capabilities.

Future Directions

The development of RML opens new avenues for research into methods for dealing with label noise in large-scale datasets. Future work could explore the integration of RML with other noise-robust training techniques, advanced semi-supervised learning models, and its application to a broader range of tasks beyond image classification. Additionally, further exploration into optimizing the selection and regrouping strategies within RML could yield even more significant improvements in model robustness to label noise.

Conclusion

In conclusion, the Regroup Median Loss method presents a formidable approach to enhancing the robustness of deep learning models to label noise. By innovatively selecting clean samples and providing a robust loss estimation, RML significantly outperforms existing methods in mitigating the effects of label noise. Its theoretical underpinning and remarkable empirical results establish RML as a valuable addition to the toolbox for training deep learning models in the presence of label noise.

PDF Markdown