Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Published 29 Sep 2023 in cs.LG, cs.AI, and cs.CV | (2309.17002v2)

Abstract: Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and LLMs that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.

Abstract PDF Upgrade to Chat

Citations (16)

View on Semantic Scholar

Summary

The paper demonstrates that slight label noise in pre-training can improve in-domain performance while deteriorating out-of-domain generalization.
It employs SVD analysis to reveal that noise inflates non-dominant singular values, thereby altering the learned feature space.
By introducing NMTune with three targeted regularizations, the approach effectively mitigates noise impact and enhances transfer learning performance.

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

The paper "Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks" addresses a critical aspect of transfer learning: the impact of label noise in large-scale pre-training datasets on downstream task performance. The PT-FT paradigm, or pre-training followed by fine-tuning, is prevalent in deep learning, but pre-training data often contains noisy labels that are costly to correct and can significantly affect model generalization.

The researchers conduct a thorough empirical analysis using the ResNet-50 architecture supervised pre-trained on unlabeled synthetic noisy ImageNet-1K and YFCC15M datasets. The study finds that although a small amount of noise in pre-training can improve performance on in-domain (ID) tasks, it negatively impacts the robustness and generalization on out-of-domain (OOD) tasks. The paper attributes this to how noise changes the feature space configuration during pre-training.

To mitigate these negative effects, the authors propose a novel method, NMTune, a lightweight black-box tuning approach. It adapts pre-trained models to downstream tasks with minimal computational overhead by leveraging three regularization objectives targeting the feature space's singular value spectrum. This method shows considerable promise, outperforming traditional linear probing and basic MLP tuning by improving performance on both ID and OOD tasks.

Key Findings:

Influence of Noise: Pre-training noise can have dual effects— minor noise addition can initially improve ID task performance by encouraging feature space expansion, yet it invariably harms OOD generalization due to reduced transferability and robustness.
Feature Space Analysis: Singular Value Decomposition (SVD) analysis reveals that noise tends to increase the singular values of non-dominant components, indicating captured noise structures that lead to worsened OOD generalization.
Regularization Strategy: By introducing consistency, covariance, and dominant singular value maximization regularizations, the proposed method reshapes feature space effectively to curtail noise impact.

Implications:

The implications of these findings are significant for large foundation models pre-trained on expansive and potentially noisy datasets. As models grow in scale, the challenge of handling noisy pre-training data becomes exacerbated, making understanding and mitigating noise effects more crucial. The proposed NMTune method sets a precedent for addressing these issues without direct access to the original pre-training data or the feasibility of fully fine-tuning such large-scale models.

Future Directions:

This work paves the way for future exploration into more sophisticated strategies for handling noise in diverse pre-training settings, potentially tackling other forms of pre-training noise beyond labels. Further refinement in the proposed regularization to adaptively balance between ID and OOD task performance could be explored. Additionally, scaling the investigation to larger and more complex architectures, such as transformers involved in multi-modality tasks, represents an exciting avenue for future research.

Overall, this paper establishes an insightful foundation for understanding the dual impacts of noise in pre-training and provides practical tools for mitigating its detrimental effects — a crucial step forward in enhancing the adaptability and robustness of pre-trained models in versatile downstream applications.

Markdown