HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition (2401.05698v2)

Published 11 Jan 2024 in cs.CV, cs.HC, cs.MM, cs.SD, and eess.AS

Abstract: Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE.

PDF HTML Abstract

Insightful Overview of "HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition"

The paper "HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition" introduces HiCMAE, a self-supervised learning framework specifically designed for audio-visual emotion recognition (AVER). It addresses the limitations of previous supervised approaches in AVER, which are significantly constrained by data scarcity. Instead, HiCMAE leverages a hierarchical approach to learn from vast amounts of unlabeled audio-visual data, combining masked data modeling with contrastive learning to enhance the learning of audio-visual emotional representations.

Key Concepts and Architecture

HiCMAE employs a unique combination of techniques to improve representation learning. Specifically, it utilizes:

Masked Data Modeling: Borrowing from the success of masked autoencoders, HiCMAE masks parts of the input data and reconstructs it, forcing the model to capture deep semantic information.
Contrastive Learning: By focusing on cross-modal correspondences, HiCMAE applies contrastive learning strategies across different levels of representation, thereby enhancing the fusion of audio and visual modalities.

A significant departure from conventional approaches, HiCMAE introduces hierarchical skip connections between its encoder and decoder, allowing intermediate layers to contribute to representation learning actively. Furthermore, the architecture incorporates hierarchical feature fusion during the fine-tuning stage, where multi-level features are integrated to boost downstream task performance.

Experimental Evidence

The paper reports comprehensive experiments conducted on nine diverse datasets covering both categorical and dimensional emotion recognition tasks, establishing HiCMAE's superior performance over both state-of-the-art supervised and self-supervised models. Empirically, HiCMAE demonstrated substantial improvements in accuracy and recall across several benchmarks, such as MAFW, DFEW, and CREMA-D, highlighting its powerful learning capabilities. Of particular note is the improvement in recognition accuracy for emotions poorly supported by existing datasets, suggesting the framework’s robustness against data imbalance challenges commonly faced in AVER.

Implications and Speculation for Future Work

The proposed HiCMAE framework opens up promising avenues in AVER by effectively utilizing unsupervised data, thus circumventing the need for large labeled datasets traditionally required in supervised learning. The inherent scalability of the self-supervised model positions it well for broader applications in diverse real-world scenarios.

The hierarchical feature alignment and skip connections employed in HiCMAE might also pave the way for further research into hierarchical multi-modal learning, potentially impacting related fields such as action recognition and multi-modal language processing. Future work could explore extending this hierarchical contrastive framework to other domains where multi-modal data plays a crucial role.

Furthermore, the promising results presented in this paper suggest that continued scaling of such self-supervised models might enhance performance further, leaning on advances in computational capabilities and algorithmic development to facilitate this growth.

In conclusion, HiCMAE stands as a substantial contribution to the field of emotion recognition, illustrating potent strategies for leveraging self-supervision in multi-modal contexts. It offers not only a boost in emotion recognition performance but also informs future directions in multi-modal machine learning research.