Temporal Ensembling for Semi-Supervised Learning (1610.02242v3)

Published 7 Oct 2016 in cs.NE and cs.LG

Abstract: In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally obtain a clear improvement in CIFAR-100 classification accuracy by using random images from the Tiny Images dataset as unlabeled extra inputs during training. Finally, we demonstrate good tolerance to incorrect labels.

Citations (2,414)

View on Semantic Scholar

Summary

The paper pioneers self-ensembling methods that combine the π-model and temporal ensembling to achieve markedly lower error rates on benchmark datasets.
Temporal ensembling aggregates outputs from multiple epochs into a stable target, enhancing computational efficiency with only one forward pass per epoch.
Empirical evaluations demonstrate robustness to label noise and improved performance on datasets such as CIFAR-10, SVHN, and CIFAR-100 with minimal labeled data.

Temporal Ensembling for Semi-Supervised Learning

Overview

The paper "Temporal Ensembling for Semi-Supervised Learning" introduces a novel approach aimed at enhancing the performance of neural networks when only a fraction of the training data is labeled. The focus is on self-ensembling techniques, specifically the $\Pi$ -model and temporal ensembling, which leverage the inherent variability in network outputs induced by dropout and input augmentations.

Methods and Implementation

$\Pi$ -Model

The $\Pi$ -model evaluates each training input twice under different dropout conditions and input augmentations. The model penalizes discrepancies between these two evaluations through a mean squared error loss, scaled by a time-dependent weighting function $w(t)$ . This strategy enforces consistency between the network's predictions for augmented versions of the same input, thus leveraging unlabeled data effectively.

Temporal Ensembling

Temporal ensembling enhances this concept by aggregating predictions from multiple previous epochs. Instead of evaluating the input twice per epoch, the model computes an ensemble prediction using the outputs from different epochs, corrected for startup bias. This approach offers computational efficiencies, requiring only one forward pass per epoch and providing a more stable target for the unsupervised loss, thanks to the aggregated and averaged predictions.

Results and Empirical Evaluation

CIFAR-10

In CIFAR-10 with 4000 labels, the $\Pi$ -model achieves an error rate of $16.55 \pm 0.29\%$ without augmentation and $12.36 \pm 0.31\%$ with standard augmentations. Temporal ensembling further reduces the error to $12.16 \pm 0.24\%$ . These results represent significant improvements over previous semi-supervised learning methods.

SVHN

For the SVHN dataset, with 1000 labels, temporal ensembling achieves an error rate of $4.42 \pm 0.16\%$ with augmentations, outperforming methods like Virtual Adversarial training and various GAN-based approaches. With only 500 labels, temporal ensembling still performs robustly, achieving $5.12 \pm 0.13\%$ .

CIFAR-100 and Tiny Images

For CIFAR-100 with 10000 labels, temporal ensembling achieves an error rate of $38.65 \pm 0.51\%$ with augmentations. Moreover, the inclusion of unlabeled data from the Tiny Images dataset shows further improvements, demonstrating the model's ability to generalize from diverse, unlabeled data sources.

Robustness to Label Noise

An additional experiment highlights the method's resilience to noisy labels. Temporal ensembling maintains high classification accuracy even with significant label noise, outperforming standard supervised training, which quickly degrades with increasing noise.

Implications and Future Directions

Practical Implications

The proposed methods, $\Pi$ -model and temporal ensembling, provide a robust framework for semi-supervised learning. The significant drop in error rates across multiple benchmarks indicates practical applicability in scenarios with limited labeled data. These techniques can be particularly useful in domains where labeled data is expensive or difficult to obtain.

Theoretical Implications

The success of self-ensembling methods underscores the importance of consistency in predictions as a form of regularization. The results suggest potential synergies with other regularization techniques and semi-supervised learning methods, such as generative models and adversarial training frameworks.

Future Work

Looking forward, integrating self-ensembling techniques with generative models, such as GANs, could provide further enhancements in performance. Additionally, extending the principles of temporal ensembling to regression tasks or exploring its efficacy in other data modalities, including text and time-series data, could broaden its application spectrum. Lastly, investigating the use of synthetic data and its impact on model performance could provide new insights into data augmentation strategies.

Conclusion

The paper presents a compelling approach to semi-supervised learning by leveraging self-ensembling techniques. Both the $\Pi$ -model and temporal ensembling demonstrate substantial improvements over previous methods, showcasing their effectiveness in mitigating the challenges posed by limited labeled data. The robustness to noisy labels further enhances their practical utility, paving the way for future advancements and applications in diverse machine learning domains.

PDF Markdown