Face Transformer for Recognition (2103.14803v2)

Published 27 Mar 2021 in cs.CV

Abstract: Recently there has been a growing interest in Transformer not only in NLP but also in computer vision. We wonder if transformer can be used in face recognition and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. Considering the original Transformer may neglect the inter-patch information, we modify the patch generation process and make the tokens with sliding patches which overlaps with each others. The models are trained on CASIA-WebFace and MS-Celeb-1M databases, and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB and IJB-C databases. We demonstrate that Face Transformer models trained on a large-scale database, MS-Celeb-1M, achieve comparable performance as CNN with similar number of parameters and MACs. To facilitate further researches, Face Transformer models and codes are available at https://github.com/zhongyy/Face-Transformer.

Citations (65)

View on Semantic Scholar

Summary

The paper demonstrates that modifying Vision Transformers with overlapping patches preserves inter-patch context and achieves performance comparable to CNNs on extensive datasets.
The approach relies on large-scale training, as shown by experiments on the MS-Celeb-1M and CASIA-WebFace datasets, highlighting data dependency for optimal Transformer performance.
The study encourages future research to enhance efficiency and adversarial robustness in face recognition by integrating self-attention mechanisms with traditional CNN strengths.

Face Transformer for Recognition: An Analytical Overview

The paper "Face Transformer for Recognition" by Yaoyao Zhong and Weihong Deng explores the applicability of Transformer architectures in the domain of facial recognition, traditionally dominated by convolutional neural networks (CNNs). This investigation is fundamentally motivated by the Transformer’s proficiency in handling various tasks in the natural language processing domain and, increasingly, in computer vision.

Research Motivation and Methodology

The core hypothesis addressed in the paper is whether Transformers can match or exceed the performance of CNNs in face recognition tasks. The paper specifically considers whether the Transformer's self-attention mechanism, which has a complexity of $O(n^2)$ with respect to sequence length, can be a viable alternative to CNNs' convolutional operations when applied to face recognition on large-scale datasets.

To harness the potential of Transformers for face recognition, the authors introduce a modification to the Vision Transformer (ViT). They hypothesize that typical ViTs might disregard significant inter-patch information by treating image patches independently. Therefore, they propose a modification where image patches overlap slightly, thereby preserving more inter-patch context.

Experimental Setup

The Face Transformer models were trained on the large-scale MS-Celeb-1M database and the medium-scale CASIA-WebFace database. Evaluation was performed across several benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB, and IJB-C databases, which are commonly used in the domain to assess face recognition systems.

The implementation details notably include the use of the CosFace loss function, chosen for its discriminative capabilities, and the adoption of an SGD optimizer with Cosine learning rate decay. Model performance was benchmarked against ResNet-100, a strong baseline in face recognition utilizing CNNs.

Key Results and Observations

The empirical results demonstrate that Face Transformer models achieve performance comparable to ResNet-100 on large-scale datasets. However, Transformer models performed poorly when trained on the smaller CASIA-WebFace dataset, indicating a dependency on vast data for effective training. The overlapping patch strategy showed improved performance over standard ViT, verifying that incorporating inter-patch information can enhance model results.

Transformer models illustrated no substantial advantage in adversarial robustness when faced with datasets like TALFW, generated using CNN surrogate models. Furthermore, an analysis using Attention Rollout revealed that these models effectively focus on facial regions, confirming expected attention patterns.

Implications and Future Directions

This paper supports the notion that with substantial data, Transformers can be powerful alternatives to CNNs for face recognition tasks. The paper’s findings underscore the importance of data scale in the success of Transformer models in visual domains.

Theoretical implications include the potential of transforming traditional face recognition systems to leverage self-attention mechanisms, which may offer new avenues for model improvement. The practical consequences suggest that future research could focus on optimizing Transformers for efficiency and exploring their integration with adversarial training techniques to potentially enhance robustness.

Conclusion

In conclusion, the application of Transformer architectures to face recognition, as explored in this paper, presents a promising research trajectory. While the necessity for large datasets remains a significant consideration, the work paves the way for more robust, flexible, and accurate face recognition systems. Future work could aim to refine these models for smaller datasets and explore hybrid architectures combining the strengths of CNNs and Transformers.

PDF Markdown

Related Papers

GitHub

GitHub - zhongyy/Face-Transformer: Face Transformer for Recognition (269 stars)