Do We Really Need to Collect Millions of Faces for Effective Face Recognition? (1603.07057v2)

Published 23 Mar 2016 in cs.CV

Abstract: Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes -- huge numbers of face images downloaded and labeled for identity -- it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognition systems. Rather than manually harvesting and labeling more faces, we simply synthesize them. We describe novel methods of enriching an existing dataset with important facial appearance variations by manipulating the faces it contains. We further apply this synthesis approach when matching query images represented using a standard convolutional neural network. The effect of training and testing with synthesized images is extensively tested on the LFW and IJB-A (verification and identification) benchmarks and Janus CS2. The performances obtained by our approach match state of the art results reported by systems trained on millions of downloaded images.

Authors (5)

Iacopo Masi (28 papers)
Anh Tuan Tran (17 papers)
Jatuporn Toy Leksut (2 papers)
Tal Hassner (48 papers)
Gerard Medioni (33 papers)

Citations (362)

View on Semantic Scholar

Summary

Analyzing "Do We Really Need to Collect Millions of Faces for Effective Face Recognition?"

The paper "Do We Really Need to Collect Millions of Faces for Effective Face Recognition?" challenges the prevailing notion that vast datasets are quintessential for training high-performance face recognition systems. The authors propose an innovative approach that relies on data augmentation through face synthesis, rather than traditional methods of collecting and labeling extensive datasets from the web. This essay provides an overview of the methodology, experimental results, and implications of the findings.

Face Synthesis and Data Augmentation

The crux of the proposed methodology lies in synthesizing facial images to enhance dataset variability without altering subject identity. The authors enrich their existing datasets by introducing three types of variations:

Pose Variations: By leveraging 3D face modeling techniques, new images are synthesized under different yaw angles. This simulates varying viewpoints and generates novel poses that expand the intra-class variability crucial for training robust recognition models.
3D Shape Variations: The method uses multiple generic 3D face shapes to render the same face through different anatomical structures. This technique disregards shape cues that don't impact perceived identity, instead focusing on augmenting subtle appearance variations.
Expression Variations: These are introduced specifically through alterations in mouth expression, particularly simulating closed mouth expressions while preserving other attributes in the image.

Convolutional Neural Network (CNN) Training

The methodology involves using a CNN architecture based on the VGG network, which is fine-tuned with the augmented datasets. The learning process includes balancing the training network's hyperparameters and employing techniques like SoftMax score fusion and video pooling to enhance recognition performance further.

Experimental Validation

The research was validated on prominent benchmarks, including LFW and IJB-A. The authors present an in-depth evaluation of their augmentation techniques, showcasing significant improvements in recognition accuracy by using synthetic data compared to many state-of-the-art systems that employ large-scale datasets. The pipeline demonstrates comparable performances to models trained on vast face libraries by utilizing merely 495K starting images, supplemented by synthetic augmentations leading to a dataset expansion up to 2.4 million images.

Practical and Theoretical Implications

The paper's contributions suggest alternative methodologies for dataset augmentation that challenge the necessity of extensive labeled datasets traditionally demanded by state-of-the-art systems. Using synthetic data resolves several issues, including the financial and logistical burdens of dataset procurement and the challenge of ethical concerns associated with data privacy.

Moreover, the paper opens avenues for future exploration into domain-specific data augmentation techniques which leverage the unique aspects of the target domain to synthesize data naturally and effectively. Examples include synthesizing age progression or altering lighting conditions in faces, which could further diversify and improve training datasets for face recognition systems.

Conclusion

This paper envisions a paradigm shift in the field of face recognition by underlining that synthetic data can significantly complement and potentially replace large real-world datasets. This synthesis approach offers considerable advantages in scalability, ethical data handling, and recognition performance, marking a substantial contribution to the field of artificial intelligence and computer vision. The insights provided in this paper could extend beyond face recognition, prompting similar applications in other domains requiring complex data patterns and variability.

PDF Markdown