MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition (2307.02227v2)

Published 5 Jul 2023 in cs.CV, cs.AI, cs.HC, and cs.MM

Abstract: Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empathetic machines. Prior efforts in this field mainly fall into supervised learning paradigm, which is severely restricted by the limited labeled data in existing datasets. Inspired by recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method which leverages large-scale self-supervised pre-training on abundant unlabeled data to largely advance the development of DFER. Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to the standalone appearance content reconstruction in VideoMAE, MAE-DFER also introduces explicit temporal facial motion modeling to encourage LGI-Former to excavate both static appearance and dynamic motion information. Extensive experiments on six datasets show that MAE-DFER consistently outperforms state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic facial representations via large-scale self-supervised pre-training. Besides, it has comparable or even better performance than VideoMAE, while largely reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks. Codes and models are publicly available at https://github.com/sunlicai/MAE-DFER.

PDF Abstract

An Academic Overview of "MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition"

The paper "MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition" presents an innovative approach to dynamic facial expression recognition (DFER) leveraging large-scale self-supervised learning. This paper provides a detailed exposition of a novel self-supervised method named MAE-DFER, which aims to mitigate the limitations of labeled data scarcity in traditional supervised learning paradigms.

Key Contributions

Introduction of MAE-DFER: The MAE-DFER approach extends the concept of masked autoencoders to the domain of video-based dynamic facial expression recognition. Inspired by the success of VideoMAE, the authors propose enhancements that specifically cater to the unique challenges posed by DFER.
Local-Global Interaction Transformer (LGI-Former): One of the paper’s pivotal contributions is the development of an efficient Transformer architecture, named LGI-Former, designed to reduce the computational overhead typically associated with fine-tuning Vision Transformers (ViT). LGI-Former adopts a localized attention mechanism and integrates it with global context interaction using representative tokens, effectively balancing computational cost and performance.
Joint Masked Appearance and Motion Modeling: MAE-DFER goes beyond appearance reconstruction by incorporating temporal motion modeling, a critical aspect for capturing dynamic expressions. By introducing frame difference signals as an additional target for reconstruction, MAE-DFER captures both static and dynamic features of facial expressions efficiently.
Empirical Validation: Through exhaustive experiments on six DFER datasets, including prominent in-the-wild and lab-controlled datasets, the authors demonstrate that MAE-DFER consistently surpasses state-of-the-art supervised methods. The results highlight significant improvements, such as +6.30% UAR on DFEW and +8.34% UAR on MAFW, substantiating the efficacy of large-scale self-supervised pre-training.

Implications and Future Directions

The implications of this research are multifaceted. Practically, MAE-DFER offers a robust framework capable of leveraging vast unlabeled datasets, promising substantial improvements in applications requiring emotional intelligence from machines, such as human-computer interaction, healthcare, and educational tools.

From a theoretical perspective, the work raises intriguing questions about the domain-specific adaptations of self-supervised learning paradigms, specifically for tasks necessitating temporal dynamics understanding. The introduction of LGI-Former suggests a potential paradigm shift in how Transformer architectures can be tailored to maximize efficiency without compromising performance.

Looking ahead, future work might explore scaling MAE-DFER with larger datasets and more complex architectures to capture nuanced expressions across wider demographic representations. Additionally, combining MAE-DFER with multimodal data (e.g., speech and physiological signals) could further enhance its applicability in real-world scenarios.

In conclusion, the paper presents a significant step forward in dynamic facial expression recognition, offering an efficient, scalable, and effective approach that challenges existing methodologies and opens avenues for future exploration in self-supervised video analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Licai Sun (19 papers)
Zheng Lian (51 papers)
Bin Liu (441 papers)
Jianhua Tao (139 papers)

Citations (28)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - sunlicai/MAE-DFER: MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition (ACM MM 2023) (97 stars)