- The paper introduces bootstrapping techniques that combine noisy labels with model predictions to enhance training robustness.
- It augments the standard prediction objective with perceptual consistency, using reconstruction error to mitigate label noise.
- Experimental results on MNIST, TFD, and ILSVRC datasets show significant gains in accuracy, precision-recall, and mean Average Precision.
Training Deep Neural Networks on Noisy Labels with Bootstrapping
This paper addresses the challenge of training deep neural networks (DNNs) in the presence of noisy and incomplete labels, which is a significant hurdle in scaling up visual object recognition and detection systems. Traditional supervised deep learning systems are highly dependent on large datasets with unambiguous, accurate labels. However, real-world data often contain incomplete and noisy annotations, especially in extensive and high-resolution imagery.
Proposed Methodology
The authors introduce a generic framework to enhance the robustness of DNNs against label noise by incorporating a notion of perceptual consistency into the training objective. The methodology involves augmenting the conventional predictive objective with a consistency term. A prediction is considered consistent if similar predictions are made given similar inputs, where similarity is defined by deep network features. The key contributions can be outlined through two main approaches:
- Consistency via Reconstruction: The first approach models the "true" label as a latent variable and leverages reconstruction error as an additional training signal. The network learns to reconstruct the input from these latent labels, effectively modeling the noise distribution by using an additional matrix mapping model predictions to noisy training labels.
- Bootstrapping:
The second, simpler approach doesn't require explicit noise modeling. Instead, it updates training targets dynamically using the model's predictions. This is achieved through a convex combination of the noisy training labels and the model’s current predictions. Two types of bootstrapping techniques are utilized:
- Soft Bootstrapping, where the training targets are the weighted probabilities of the predicted classes.
- Hard Bootstrapping, where the training targets are the weighted one-hot encoded vectors of the predicted classes.
Experimental Results
The proposed methods were evaluated on several datasets: MNIST, Toronto Faces Database (TFD), and the ILSVRC2014 detection challenge data.
- MNIST with Noisy Labels: In experiments where labels were corrupted with varying probabilities, bootstrapped models, particularly
bootstrap-recon
and bootstrap-hard
, significantly outperformed baseline models that used only the standard prediction objective. This demonstrates substantial robustness to label noise.
- Toronto Faces Database Emotion Recognition: The models trained with bootstrapping techniques achieved higher accuracy in emotion recognition compared to both the standard supervised methods and other state-of-the-art weakly-supervised learning methods. Notably, the
bootstrap-recon
and bootstrap-hard
methods provided the best results.
- ILSVRC2014 Object Detection: The methods were applied to the MultiBox network for object detection, particularly focusing on person detection. Bootstrapping methods significantly improved precision-recall performance compared to the baseline and the previously employed heuristic of suppressing the top-K most confident predictions. The gains were also observed in the broader context of the full 200-category ILSVRC2014 detection task, where bootstrapping improved mean Average Precision (mAP) and recall rates.
Implications and Future Directions
This work implies that incorporating perceptual consistency into the training objectives of DNNs can significantly mitigate the adverse effects of noisy and incomplete labels. Practically, this suggests that large-scale vision systems can benefit from cheaper, less exhaustive labeling processes without sacrificing model performance. The proposed bootstrapping approach is particularly appealing due to its simplicity and ease of integration into existing deep learning workflows.
Future research could refine this approach by learning dynamic policies for the consistency parameter β or extending this methodology to scenarios involving situated agents. Another potential direction is leveraging a broader range of unlabeled and poorly-labeled data to further enhance training for large-scale object detection tasks.
The presented techniques mark a solid step towards more resilient and scalable deep learning models in vision applications, paving the way for further advancements in handling real-world, imperfect data in various AI domains.