Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data (2010.03622v5)

Published 7 Oct 2020 in cs.LG and stat.ML

Abstract: Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic "expansion" assumption, which states that a low probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.

Authors (4)

Colin Wei (17 papers)
Kendrick Shen (3 papers)
Yining Chen (35 papers)
Tengyu Ma (117 papers)

Citations (215)

View on Semantic Scholar

Summary

The paper establishes a rigorous theoretical framework for self-training deep networks by introducing an expansion assumption to enhance label accuracy.
It demonstrates that input-consistency regularization can denoise pseudolabels, improving model performance in semi-supervised learning and domain adaptation.
The study provides polynomial sample complexity guarantees, validated through experiments with GAN-generated images, highlighting its practical impact.

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

This paper addresses the challenge of leveraging unlabeled data for training deep neural networks through self-training methodologies. While self-training has been empirically successful, its theoretical underpinnings have remained largely unexplored, particularly with deep network models. The work contributes a rigorous theoretical framework for understanding self-training in the context of deep learning, extending previous analyses which were limited to linear models.

Key Contributions and Assumptions

The authors introduce a core assumption termed as the "expansion assumption," which posits that a low-probability subset of data should expand to a neighborhood with a large probability relative to that subset. This assumption aligns with the intuition that the data distribution should be sufficiently connected, and it plays a critical role in guaranteeing that self-trained models achieve high accuracy. Additionally, it assumes minimal overlap between neighborhoods of examples from different classes. These assumptions lay the groundwork for proving that minimizers of population objectives, combined with input-consistency regularization, will yield high accuracy concerning ground-truth labels.

Theoretical Framework and Analysis

The analysis is divided into three major areas: semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. For semi-supervised learning and domain adaptation, the objective is to train a model on pseudolabels with input-consistency regularization, thus denoising any label errors made by the initial pseudolabeling model. Theoretical results suggest that under the expansion and separation assumptions, the trained model can improve on the pseudolabeler's accuracy by denoising the labels through consistent training with data transformations.

In the unsupervised learning section, the theorem demonstrates that the expansion assumption enables learning a classifier that, despite being unsupervised, aligns its predicted class structure with ground-truth categories when applied to a suitable objective.

Numerical Results and Experimental Validation

A significant aspect of the paper is its provision of polynomial sample complexity guarantees for deep neural nets. These guarantees are tied to the spectrum of the model's margin and Lipschitzness, providing insights into sample efficiency and the generalization of models trained under the proposed theoretical framework. An experimental validation employing GAN-generated images supports the empirical utility of the expansion assumption, showing its feasibility for real-world data distributions.

Implications and Future Directions

This analysis bridges a vital theoretical gap by offering a unified framework for understanding the effectiveness of self-training with deep networks. It suggests that the design of self-training algorithms can be refined by focusing on ensuring data distribution properties like expansion and employing input-consistency regularizations.

Potential future work could involve extending these theoretical insights to more complex model architectures and exploring domain adaptation methods that rely on data distribution alignment. Further empirical studies on real-world datasets could validate the robustness of these theoretical results in varying practical scenarios beyond the controlled settings examined in this paper.

In conclusion, the paper provides a comprehensive theoretical analysis that lays the groundwork for improved self-training methodologies with deep networks. This enriches the understanding of when and why self-training works, enabling more efficient and accurate learning from unlabeled data.

PDF Markdown