Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-training with Noisy Student improves ImageNet classification (1911.04252v4)

Published 11 Nov 2019 in cs.LG, cs.CV, and stat.ML

Abstract: We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. Code is available at https://github.com/google-research/noisystudent.

Noisy Student Training: Enhancing ImageNet Classification with Unlabeled Data

The paper "Self-training with Noisy Student improves ImageNet classification" by Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le introduces a semi-supervised learning method named Noisy Student Training. This approach significantly boosts top-1 accuracy on the ImageNet dataset and improves model robustness across several challenging test sets, leveraging the vast availability of unlabeled data.

Summary of Noisy Student Training

Noisy Student Training builds on the principles of self-training and distillation but implements critical innovations: it employs equal-or-larger student models and introduces noise into the training process. The training procedure comprises the following steps:

  1. Train a Teacher Model: A teacher model is trained on labeled data using standard supervised learning techniques.
  2. Generate Pseudo Labels: The trained teacher model is utilized to generate pseudo labels for a large set of unlabeled images.
  3. Train a Student Model: A larger student model is trained on the combined dataset of labeled and pseudo-labeled images, with noise introduced during training.
  4. Iterate: The student model replaces the teacher model to repeat the process, iteratively enhancing model performance.

In the experiments, EfficientNet models are employed due to their favorable capacity scaling attributes. Initially, an EfficientNet model is trained on labeled ImageNet data. Then, the trained EfficientNet generates pseudo labels for 300 million unlabeled images, leading to the training of a larger EfficientNet model as the student under the same pseudo-labeled and labeled data. During student training, diverse noise types such as dropout, stochastic depth, and data augmentation (using RandAugment) are applied.

Experimental Results: Strong Numerical Gains

Accuracy Improvements: Employing Noisy Student Training, EfficientNet-L2 achieves an impressive 88.4% top-1 accuracy on ImageNet, surpassing the state-of-the-art by a substantial margin. Notably, this result is 2.0% higher than the previous best model that used 3.5 billion weakly labeled Instagram images.

Robustness Gains: Noisy Student Training shows remarkable robustness improvements on challenging datasets:

  • ImageNet-A: Top-1 accuracy improved from 61.0% to 83.7%.
  • ImageNet-C: Mean corruption error (mCE) reduced from 45.7 to 28.3.
  • ImageNet-P: Mean flip rate (mFR) decreased from 27.8 to 12.2.

These results underscore the efficacy of Noisy Student Training in enhancing model robustness against various corruptions and perturbations, further validating its generalization capabilities.

Practical and Theoretical Implications

Practically, Noisy Student Training demonstrates a powerful approach to leverage unlabeled data at scale, which is more readily available than labeled data. The approach can be broadly applied across different architectures, as evidenced by improvements not just in EfficientNet but also in ResNet models.

Theoretically, this work advances the understanding of semi-supervised learning by emphasizing the importance of noise and model size in self-training. It extends the utility of pseudo labels in enhancing out-of-domain data learning and shows that distillation is not merely about compressing information but can also be about learning beyond the original capabilities of the teacher model.

Future Directions

The promising results from Noisy Student Training open several pathways for future research:

  1. Scaling with More Data: Investigating the benefits of even larger unlabeled datasets.
  2. Optimizing Noise: Exploring more sophisticated noise injection techniques to further boost robustness.
  3. Cross-Task Generalization: Extending Noisy Student Training to other domains such as natural language processing and reinforcement learning.
  4. Theoretical Foundations: Further theoretical work to formalize the understanding of why and how noise contributes so effectively to semi-supervised learning.

Lastly, understanding the trade-offs between model size, noise levels, and dataset sizes will be crucial in fine-tuning Noisy Student Training for various practical implementations.

In summary, Noisy Student Training represents a significant step forward in semi-supervised learning, demonstrating how unlabeled data, combined with strategic noise application, can substantially enhance both accuracy and robustness of state-of-the-art models. The methodology and findings should stimulate further research and application in leveraging large-scale unlabeled datasets across different machine learning tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Qizhe Xie (15 papers)
  2. Minh-Thang Luong (32 papers)
  3. Eduard Hovy (115 papers)
  4. Quoc V. Le (128 papers)
Citations (2,252)
Youtube Logo Streamline Icon: https://streamlinehq.com