Toward High Quality Facial Representation Learning (2309.03575v1)

Published 7 Sep 2023 in cs.CV

Abstract: Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called \textbf{\it Mask Contrastive Face (MCF)}, with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME$_{diag}$ for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.

Citations (4)

View on Semantic Scholar

Summary

The paper proposes a self-supervised MCF framework that combines mask image modeling with contrastive learning to enhance facial representation from unlabeled data.
Its methodology reconstructs feature maps instead of pixels, preserving facial context and improving intra-class compactness and inter-class separability.
Experimental results on the LAION-FACE-cropped dataset show notable gains with a face alignment NME of 0.932 and an F1 score of 93.96 for face parsing.

Toward High-Quality Facial Representation Learning

The paper "Toward High-Quality Facial Representation Learning" proposes a self-supervised learning model named Mask Contrastive Face (MCF), specifically focused on enhancing performance for facial analysis tasks. The primary objective of this research is to develop an effective pre-training approach for face alignment, face parsing, and other face-centric applications, using vast amounts of unlabeled data.

Proposed Methodology: MCF Framework

The researchers introduce a self-supervised pre-training framework, MCF, which combines mask image modeling with a contrastive learning strategy tailored for face-domain tasks. This methodology aims to extend the capabilities of existing facial representation models by solving the problems associated with supervised learning, such as high labeling costs and poor generalization.

Mask Contrastive Learning

The MCF framework incorporates a unique strategy that integrates mask image modeling and contrastive learning. The mask image modeling employs a partially pre-trained visual backbone as the supervision component, and an innovative partially pre-trained decoder for image reconstruction, which aims to optimize the model for facial representation learning. This process starts with randomly masking parts of the input images and using the remaining patches to predict the masked regions. A more sophisticated approach diverging from prior methods, this framework avoids direct pixel reconstruction and instead reconstructs feature maps, which provide richer information for learning.

Contrastive Learning Mechanism

To manage the identity of faces during the pre-training stage, the authors use random masks to generate contrastive pairs. Unlike conventional methods that rely on random cropping to form positive pairs, the proposed approach avoids loss of key facial context by retaining the full face structure within different random patches. This ultimately enhances the intra-class compactness and inter-class separability of the learned features.

Dataset and Experimental Setup

To validate the effectiveness of their framework, the authors utilized the LAION-FACE-cropped dataset, which includes more than 20 million face images sourced from the web. A subset of this dataset was used to benchmark the pre-training efficiency on a smaller scale before scaling up to the full dataset. The pre-training experiments, conducted on a scaled-down portion of LAION-FACE-cropped, establish the efficiency and superiority of the proposed MCF framework under various pre-training settings.

Results and Evaluation

The proposed MCF model, when pre-trained on the full dataset, achieves significant advancements in downstream tasks. Notably, the model records an NME $_{diag}$ score of 0.932 on the AFLW-19 face alignment task and an F1 score of 93.96 for LaPa face parsing. These results represent an improvement over existing state-of-the-art methods. The comparison to other facial representation learning models, including supervised and text-supervised methods like FaRL, showcases the model's robustness and efficiency.

Practical and Theoretical Implications

The research contributes significantly to both practical applications and theoretical understanding. Practically, the MCF framework enhances face alignment and parsing tasks by providing a robust self-supervised pre-training method that leverages large-scale unlabeled data. Theoretically, the hybrid approach of mask image modeling with a partially pre-trained decoder and the novel contrastive learning strategy offers a new perspective on improving facial representation learning without relying on annotated datasets.

Potential Future Developments

The MCF framework sets the stage for future research aimed at optimizing self-supervised learning in specific domains. Potential future developments could explore fine-tuning the components of this framework for higher specialization in different facial analysis tasks or integrating additional contextual information to further improve the model's performance. Another promising direction could be extending the methodology to other domains that require high-quality representation learning but face limitations in labeled data availability.

Conclusion

The authors' exploration of high-performance pre-training methods through the MCF framework provides an impactful contribution to the field of facial representation learning. By addressing key issues associated with supervised learning and efficiently utilizing large-scale unlabeled data, this research paves the way for more adaptable and robust facial analysis systems. The numerical results and comparative evaluations underscore the significant performance gains achieved by the proposed methodology, highlighting its potential as a cornerstone for future advancements in AI-driven facial analysis.

PDF Markdown