Learning From Noisy Large-Scale Datasets With Minimal Supervision (1701.01619v2)

Published 6 Jan 2017 in cs.CV

Abstract: We present an approach to effectively use millions of images with noisy annotations in conjunction with a small subset of cleanly-annotated images to learn powerful image representations. One common approach to combine clean and noisy data is to first pre-train a network using the large noisy dataset and then fine-tune with the clean dataset. We show this approach does not fully leverage the information contained in the clean set. Thus, we demonstrate how to use the clean annotations to reduce the noise in the large dataset before fine-tuning the network using both the clean set and the full set with reduced noise. The approach comprises a multi-task network that jointly learns to clean noisy annotations and to accurately classify images. We evaluate our approach on the recently released Open Images dataset, containing ~9 million images, multiple annotations per image and over 6000 unique classes. For the small clean set of annotations we use a quarter of the validation set with ~40k images. Our results demonstrate that the proposed approach clearly outperforms direct fine-tuning across all major categories of classes in the Open Image dataset. Further, our approach is particularly effective for a large number of classes with wide range of noise in annotations (20-80% false positive annotations).

Citations (465)

View on Semantic Scholar

Summary

The paper presents a multi-task network that jointly cleans noisy labels and classifies images using shared ConvNet features.
It leverages a small set of clean annotations to mitigate noise in large datasets, significantly improving mean average precision.
The approach effectively captures label dependencies on the Open Images dataset, paving the way for scalable, cost-efficient image annotation.

Learning From Noisy Large-Scale Datasets With Minimal Supervision

The paper "Learning From Noisy Large-Scale Datasets With Minimal Supervision" presents a method to exploit noisy annotated image datasets alongside a smaller subset of cleanly annotated images for training Convolutional Neural Networks (ConvNets) efficiently. The authors propose an innovative approach to mitigate noise in annotations to enhance image classification performance. This research addresses a critical challenge in computer vision: the dependency on vast collections of cleanly labeled data, which are expensive and time-consuming to obtain.

Key Contributions

The primary contribution is a semi-supervised learning framework that leverages a small number of clean annotations to infer and reduce noise level in a larger set of noisy annotations effectively. This is accomplished through a multi-task model architecture that jointly learns to clean noisy labels and classify images.

Multi-task Network Architecture: The researchers introduce a network design comprising two key components: a label cleaning network and an image classifier that share visual features extracted via a ConvNet. The label cleaning network learns to map noisy labels to clean ones by considering input image features, while the classifier predicts the image labels, using the cleaned labels as targets.
Evaluation on Open Images Dataset: The approach is evaluated on the Open Images dataset, a large-scale dataset inundated with noise, containing approximately 9 million images annotated across over 6000 classes. By experimenting with varying noise levels, the authors demonstrate superior performance in image classification compared to traditional fine-tuning methods.
Handling Label Dependencies and Noise: The network captures dependencies between labels by assuming the presence of structured noise, where noise often adheres to certain relational patterns in labels rather than being randomly distributed. The model accounts for this statistical structure during training.

Strong Numerical Results

Quantitative evaluations reveal that the proposed model outperforms a direct fine-tuning strategy fostering significant improvements in mean average precision (MAP). Performance enhancements were especially notable in categories encompassing widely varying noise levels, ranging from 20% to 80% false positive annotations.

Implications and Speculation on Future Developments

Practically, this method reduces reliance on large volumes of clean data, paving the way for cost-effective and scalable machine learning systems. Theoretically, this suggests potential advancements in transferring learned noise patterns across domains or adapting them in settings with limited labeled data. Significant improvement across categories also indicates robust applicability to a diverse array of tasks, including those beyond image classification.

Future extensions could include exploring richer interactions between label and image features, such as employing higher-dimensional interactions via bilinear pooling or other advanced feature fusion strategies. Additionally, adapting the framework for cross-domain label transformation could offer exciting possibilities in applications like cross-modal retrieval or multimodal learning environments.

Conclusion

By effectively utilizing noisy data, this work takes a step towards overcoming the scalability hurdles present in current ConvNet training paradigms. It offers a well-rounded solution that incrementally refines labels, reduces noise, and ultimately, enhances classification accuracy, setting the stage for further innovations in semi-supervised and unsupervised learning landscapes. The findings hold promise for applications across fields requiring high fidelity in automatic image annotation and understanding.

PDF Markdown