Learning Face Representation from Scratch (1411.7923v1)

Published 28 Nov 2014 in cs.CV

Abstract: Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.

Citations (1,948)

View on Semantic Scholar

Summary

The paper introduces the CASIA-WebFace dataset, enabling deep CNN training with 500K images across 10K subjects.
It proposes a semi-automatic data collection method using IMDb and multi-view detectors for accurate face annotation.
An 11-layer CNN architecture achieves 97.73% on LFW and 92.24% on YTF, demonstrating robust face recognition performance.

Learning Face Representation from Scratch

The research paper "Learning Face Representation from Scratch," authored by Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z. Li, addresses the problem of performance disparities in deep convolutional neural networks (CNNs) for face recognition due to the lack of publicly available large-scale face datasets. The authors introduce CASIA-WebFace, a large-scale dataset comprising approximately 10,000 subjects and 500,000 images, which significantly contributes to the advancement of face recognition algorithms in unrestricted scenarios.

Overview of Contributions

The paper makes several notable contributions:

CASIA-WebFace Dataset: The dataset facilitates the training of deep CNNs, democratizing access to large-scale data for researchers and bridging the gap created by private datasets.
Semi-Automatic Data Collection Method: A semi-automatic approach is proposed for collecting and annotating face images from the internet, leveraging the IMDb database to curate and label images effectively.
Baseline Deep CNN Architecture: The paper provides a detailed architecture of an 11-layer CNN, training it on CASIA-WebFace and demonstrating its superior performance in face recognition tasks.

Dataset Collection and Annotation

The CASIA-WebFace dataset is collected by web scraping images from IMDb, focusing on celebrities born between 1940 and 2014. The faces are detected using a multi-view face detector, clustering them based on a face recognition engine's similarity scores and manual corrections. This semi-automatic method ensures both scalability and accuracy in creating a comprehensive face dataset.

Methodology for Face Representation Learning

The baseline CNN architecture integrates state-of-the-art techniques such as:

Very Deep Architecture: Inspired by the success of deep models in various computer vision tasks, the network is designed with 10 convolutional layers and one fully connected layer.
Small Filters and ReLU Neurons: Use of small 3x3 filters across all convolutional layers and Rectified Linear Units (ReLU) to introduce non-linearity.
Dropout Regularization: To mitigate overfitting, a dropout layer is included before the fully connected layer.
Low-Dimensional Representations: Ensuring the final face representation is compact yet discriminative.

Experimental Evaluation

The performance of the proposed CNN is evaluated on two popular benchmarks: LFW (Labeled Faces in the Wild) and YTF (YouTube Faces). Key results from the paper include:

LFW Results: The CNN achieves a verification accuracy of 97.73%, surpassing DeepFace's ensemble of seven networks and approaching DeepID2's results with fewer networks.
YTF Results: The method attains an accuracy of 92.24%, indicating high generalization capability across different data sources and conditions.

Additionally, evaluations using the BLUFR protocol demonstrate the robustness of the trained model in verification and identification scenarios at varying false acceptance rates (FAR).

Theoretical and Practical Implications

The publication of the CASIA-WebFace dataset has significant implications:

Standardization: It provides a common ground for evaluating face recognition algorithms, promoting reproducibility and comparability in research.
Data Diversity and Scale: The extensive dataset enables the training of more complex models that are better suited to handle real-world variations in pose, lighting, and occlusions.
Research Acceleration: By making large-scale data available, the paper encourages more groups to explore advancements in deep learning algorithms for face recognition.

Future Directions

Future research can build upon the foundation laid by this work in several ways:

Dataset Expansion: Utilizing commercial image search engines to further enhance the dataset's diversity and scale.
Advanced Annotation Techniques: Developing more sophisticated tools and methods for accurate and efficient data annotation.
Optimized Networks: Exploring optimized single-network architectures or ensembles to push the boundaries of face recognition accuracy.

Conclusion

"Learning Face Representation from Scratch" provides a thorough examination of the benefits and applications of large-scale face datasets in training deep CNNs for face recognition. The introduction of CASIA-WebFace and the detailed methodology for constructing effective CNNs represent substantial progress in the field, offering a valuable resource and framework for future research endeavors.

PDF Markdown