Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Devil of Face Recognition is in the Noise

Published 31 Jul 2018 in cs.CV | (1807.11649v1)

Abstract: The growing scale of face recognition datasets empowers us to train strong convolutional networks for face recognition. While a variety of architectures and loss functions have been devised, we still have a limited understanding of the source and consequence of label noise inherent in existing datasets. We make the following contributions: 1) We contribute cleaned subsets of popular face databases, i.e., MegaFace and MS-Celeb-1M datasets, and build a new large-scale noise-controlled IMDb-Face dataset. 2) With the original datasets and cleaned subsets, we profile and analyze label noise properties of MegaFace and MS-Celeb-1M. We show that a few orders more samples are needed to achieve the same accuracy yielded by a clean subset. 3) We study the association between different types of noise, i.e., label flips and outliers, with the accuracy of face recognition models. 4) We investigate ways to improve data cleanliness, including a comprehensive user study on the influence of data labeling strategies to annotation accuracy. The IMDb-Face dataset has been released on https://github.com/fwang91/IMDb-Face.

Citations (190)

Summary

  • The paper demonstrates that effective dataset cleaning significantly reduces label noise, improving CNN performance in face recognition.
  • Researchers reveal that label flips harm model accuracy more than outliers, showing that clean datasets enable efficient training.
  • The release of the noise-controlled IMDb-Face dataset and refined annotation strategies offers a promising path to more robust face recognition systems.

Analyzing the Impact of Label Noise on Face Recognition

The paper "The Devil of Face Recognition is in the Noise" presents an insightful investigation into the challenges posed by label noise in large-scale face recognition datasets. The researchers have addressed the pervasive issue of label noise, which can significantly impact the performance of convolutional neural networks (CNNs) used in face recognition tasks.

Contributions and Methodology

The authors have made several important contributions:

  1. Dataset Cleaning and Creation: Cleaned subsets of the MegaFace and MS-Celeb-1M datasets were developed, along with a new large-scale dataset, IMDb-Face, which is controlled for noise. The new dataset is publicly available and contains 1.7M images of 59K celebrities sourced from the IMDb website.
  2. Noise Analysis: With these datasets—both original and cleaned—the authors analyzed the properties and effects of label noise, including label flips and outliers. Their findings suggest that achieving comparable accuracy with a clean subset requires several orders of magnitude more samples, emphasizing the detrimental impact of noise.
  3. Impact on Face Recognition Models: The study investigates the correlation between different types of noise and model accuracy. The authors confirmed that face recognition models are more adversely affected by label flips than by outliers.
  4. Data Annotation Strategies: Various strategies for data labeling were explored to improve annotation accuracy. A comprehensive user study was conducted, showing that annotation accuracy is correlated with the time spent processing each image, providing insights that could optimize future labeling practices.

Results and Implications

The experiments demonstrate that a significant portion of the accuracy achievable by existing datasets is hampered by noise. For instance, models trained on only 32% of the cleaned MegaFace subset or 20% of the cleaned MS-Celeb-1M subset perform comparably to those trained on their respective full datasets. These results highlight the potential efficiency gains from using cleaner datasets.

In comparing the new IMDb-Face dataset with others, the research shows that performance is competitive despite its somewhat smaller size. The study thus underscores the importance of data quality over sheer quantity in training machine learning models for complex tasks like face recognition.

By exploring the IMDb-Face data collection from movie screenshots and posters, the paper suggests a more noise-resilient approach compared to traditional search engine-based data gathering methods. This approach could lead to more diverse and varied datasets, promoting robustness in face recognition algorithms.

Future Directions

The research points to several avenues for future work:

  • Developing advanced learning algorithms that can inherently handle noise without significant performance degradation.
  • Expanding the scope of clean, verified datasets to cover a broader spectrum of use cases in face recognition.
  • Evaluating the reliability and efficiency of various annotation strategies within different contexts and scales, particularly for real-world deployment.

Conclusion

Overall, the paper addresses the critical challenge of label noise in face recognition datasets, offering foundational work that both enhances current understanding and paves the way for more robust systems. The release of the IMDb-Face dataset provides a valuable resource for the community, aiming to drive forward the development of cleaner data and more noise-tolerant algorithms. This work is vital for applications requiring high accuracy and reliability, setting a new standard in the data quality available to AI researchers in the domain of face recognition.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 99 likes about this paper.