Training Deep Neural Networks on Noisy Labels with Bootstrapping (1412.6596v3)

Published 20 Dec 2014 in cs.CV, cs.LG, and cs.NE

Abstract: Current state-of-the-art deep learning systems for visual object recognition and detection use purely supervised training with regularization such as dropout to avoid overfitting. The performance depends critically on the amount of labeled examples, and in current practice the labels are assumed to be unambiguous and accurate. However, this assumption often does not hold; e.g. in recognition, class labels may be missing; in detection, objects in the image may not be localized; and in general, the labeling may be subjective. In this work we propose a generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency. We consider a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data. In experiments we demonstrate that our approach yields substantial robustness to label noise on several datasets. On MNIST handwritten digits, we show that our model is robust to label corruption. On the Toronto Face Database, we show that our model handles well the case of subjective labels in emotion recognition, achieving state-of-the- art results, and can also benefit from unlabeled face images with no modification to our method. On the ILSVRC2014 detection challenge data, we show that our approach extends to very deep networks, high resolution images and structured outputs, and results in improved scalable detection.

Citations (981)

View on Semantic Scholar

Summary

The paper introduces bootstrapping techniques that combine noisy labels with model predictions to enhance training robustness.
It augments the standard prediction objective with perceptual consistency, using reconstruction error to mitigate label noise.
Experimental results on MNIST, TFD, and ILSVRC datasets show significant gains in accuracy, precision-recall, and mean Average Precision.

Training Deep Neural Networks on Noisy Labels with Bootstrapping

This paper addresses the challenge of training deep neural networks (DNNs) in the presence of noisy and incomplete labels, which is a significant hurdle in scaling up visual object recognition and detection systems. Traditional supervised deep learning systems are highly dependent on large datasets with unambiguous, accurate labels. However, real-world data often contain incomplete and noisy annotations, especially in extensive and high-resolution imagery.

Proposed Methodology

The authors introduce a generic framework to enhance the robustness of DNNs against label noise by incorporating a notion of perceptual consistency into the training objective. The methodology involves augmenting the conventional predictive objective with a consistency term. A prediction is considered consistent if similar predictions are made given similar inputs, where similarity is defined by deep network features. The key contributions can be outlined through two main approaches:

Consistency via Reconstruction: The first approach models the "true" label as a latent variable and leverages reconstruction error as an additional training signal. The network learns to reconstruct the input from these latent labels, effectively modeling the noise distribution by using an additional matrix mapping model predictions to noisy training labels.
Bootstrapping:

The second, simpler approach doesn't require explicit noise modeling. Instead, it updates training targets dynamically using the model's predictions. This is achieved through a convex combination of the noisy training labels and the model’s current predictions. Two types of bootstrapping techniques are utilized: - Soft Bootstrapping, where the training targets are the weighted probabilities of the predicted classes. - Hard Bootstrapping, where the training targets are the weighted one-hot encoded vectors of the predicted classes.

Experimental Results

The proposed methods were evaluated on several datasets: MNIST, Toronto Faces Database (TFD), and the ILSVRC2014 detection challenge data.

MNIST with Noisy Labels: In experiments where labels were corrupted with varying probabilities, bootstrapped models, particularly bootstrap-recon and bootstrap-hard, significantly outperformed baseline models that used only the standard prediction objective. This demonstrates substantial robustness to label noise.
Toronto Faces Database Emotion Recognition: The models trained with bootstrapping techniques achieved higher accuracy in emotion recognition compared to both the standard supervised methods and other state-of-the-art weakly-supervised learning methods. Notably, the bootstrap-recon and bootstrap-hard methods provided the best results.
ILSVRC2014 Object Detection: The methods were applied to the MultiBox network for object detection, particularly focusing on person detection. Bootstrapping methods significantly improved precision-recall performance compared to the baseline and the previously employed heuristic of suppressing the top-K most confident predictions. The gains were also observed in the broader context of the full 200-category ILSVRC2014 detection task, where bootstrapping improved mean Average Precision (mAP) and recall rates.

Implications and Future Directions

This work implies that incorporating perceptual consistency into the training objectives of DNNs can significantly mitigate the adverse effects of noisy and incomplete labels. Practically, this suggests that large-scale vision systems can benefit from cheaper, less exhaustive labeling processes without sacrificing model performance. The proposed bootstrapping approach is particularly appealing due to its simplicity and ease of integration into existing deep learning workflows.

Future research could refine this approach by learning dynamic policies for the consistency parameter $\beta$ or extending this methodology to scenarios involving situated agents. Another potential direction is leveraging a broader range of unlabeled and poorly-labeled data to further enhance training for large-scale object detection tasks.

The presented techniques mark a solid step towards more resilient and scalable deep learning models in vision applications, paving the way for further advancements in handling real-world, imperfect data in various AI domains.

PDF Markdown