Self-Training: A Survey (2202.12040v5)

Published 24 Feb 2022 in cs.LG

Abstract: Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.

Authors (6)

Massih-Reza Amini (40 papers)
Vasilii Feofanov (14 papers)
Loic Pauletto (3 papers)
Lies Hadjadj (3 papers)
Emilie Devijver (33 papers)
Yury Maximov (36 papers)

Citations (82)

View on Semantic Scholar

Summary

Introduction

Semi-supervised learning (SSL) approaches have paved the way for learning from limited labeled datasets by leveraging large volumes of unlabeled data. The incorporation of emerging techniques has revitalized interest in self-training methods, which particularly stand out due to their ability to identify decision boundaries in low-density regions.

Self-Training Techniques

Self-training, or decision-directed learning, employs an iterative process where a base classifier is initially trained on labeled data. This classifier is then used to pseudo-label the most confidently predicted unlabeled samples. Pseudo-labeled data are added to the training set, and the classifier is retrained. Crucial to this approach are pseudo-labeling strategies, affecting how the subset of pseudo-labeled examples is chosen. Historical methods relied on fixed threshold values to select the most confident predictions for pseudo-labeling, but recent research suggests that adaptively determining these thresholds based on the model's predictions yields superior performance.

Theoretical Advancements and Noise Management

Theoretical explorations into self-training offer insight into distributional guarantees and finite-sample bounds, especially for models like deep neural networks. These advances posit that under certain conditions, leveraging unlabeled data effectively reduces the generalization error of hypotheses. Furthermore, algorithmic considerations of label noise have given rise to techniques like Debiased Self-Training, where a classifier is trained with a potential noise-awareness component, which is essential for reducing the influence of incorrectly pseudo-labeled samples.

Applications Across Domains

Applications of self-training span a variety of fields, including natural language processing, computer vision, and even domain-specific tasks like genomics and anomaly detection. For example, self-training methods are being adapted to enhance text classification by utilizing LLMs for generating pseudo-labels. Similarly, computer vision benefits from self-training through methods such as FixMatch and Mean-Teacher that enforce consistency within perturbed versions of data for improved model robustness.

Empirical Assessment

Empirical studies further our understanding of self-training's practical abilities. Evaluations on well-known datasets such as CIFAR-10/100 and ImageNet showcase that the effectiveness of self-training is greatly reliant on the choice of pseudo-labeling threshold and the inclusion of noise management during training. Automatic threshold selection strategies like the one articulated by Feofanov et al. (2019) highlight the potential for refining the self-training process and achieving competitive results against purely supervised methods, especially when a limited amount of labeled data is available.

Future Perspectives

Looking ahead, the utility of self-training in SSL is promising yet untapped in various emerging domains. Moreover, theoretical and algorithmic advancements to mitigate the distribution shift in domain adaptation tasks and account for data noise in pseudo-label strategies could serve as fertile ground for future research. As self-training integrates with complex neural architectures and the adaptation for domain-specific tasks broadens, so too will our understanding of its generalizing capabilities and the nature of learning in machine intelligence.

PDF Markdown