Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition (2008.06682v1)

Published 15 Aug 2020 in eess.AS, cs.CL, cs.SD, and cs.AI

Abstract: Multimodal emotion recognition from speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality-specific "BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), we show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results. We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones when using SSL models that have similar architectural properties to BERT.

PDF Abstract

The paper presents an innovative approach to multimodal speech emotion recognition by leveraging "BERT-like" self-supervised learning (SSL) models specifically designed for processing speech and text inputs. These models are fine-tuned jointly to improve the extraction of emotional cues from both modalities. The research implemented and evaluated on three datasets— IEMOCAP, CMU-MOSEI, and CMU-MOSI—demonstrated enhancements in state-of-the-art (SOTA) performance metrics, showcasing the efficacy of these SSL architectures in the domain of affective computing.

Key Contributions:

Modality Specific SSL Models:
- The paper employs 'BERT-like' SSL architectures, particularly suited for text and speech modalities, in which the text component utilizes RoBERTa, an extension of BERT that eschews the next sentence prediction task, while the speech component uses VQ-Wav2Vec and a novel Speech-BERT model formulated for the task. These SSL models are pretrained on substantial unlabeled datasets, thus enabling robust representation learning without requiring extensive labeled data.
Fusion Mechanisms:
- The paper investigates two fusion strategies for combining speech and text embeddings:
  - Shallow Fusion: A straightforward approach that concatenates the classification tokens ( $CLS$ ) from each modality and passes them through a simple prediction head, resulting in superior performance with fewer parameters.
  - Co-Attentional Fusion: A more complex method offering deeper interaction between modalities through a co-attentional mechanism. This strategy involves interchanging query vectors with key-value pairs from opposite modalities to facilitate a detailed cross-modal interaction.
Joint Fine-Tuning:
- Contrary to prior methods that use feature extractors statically, the researchers fine-tuned both Speech-BERT and RoBERTa jointly. This approach retained the inherent benefits of SSL models while specializing them for emotion recognition tasks, attaining higher accuracy and F1 scores for emotion classification tasks across multiple benchmark datasets.
Ablation Studies:
- The experiments assessed various aspects, such as the impact of fusion strategies under fine-tuned and frozen states, and compared unimodal vs. multimodal performances, confirming that multimodal integration substantially outperforms unimodal methods.
Performance Metrics:
- The approach achieves top performance based on metrics such as Binary Accuracy (BA), F1-score, 7-class accuracy, and Mean-Average-Error (MAE) across tested datasets, ultimately establishing new benchmarks for multimodal emotion recognition tasks.

Implications:

The integration of similar architectural properties between Speech-BERT and RoBERTa simplifies the model design, adopting principles demonstrated in NLP for speech tasks. The success of BERT-like architectures in emotion recognition offers an adaptable framework that may be extended to other complex multimodal tasks.
The paper exemplifies the potential of SSL paradigms in scenarios with limited labeled data, utilizing powerful pre-trained models to achieve improved performance in downstream applications with minimal additional computational resources.

Overall, this work provides a viable pathway for improving emotion recognition systems by using advanced SSL methods tailored for multimodal data, with implications for both academia and practical applications in affective computing systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shamane Siriwardhana (8 papers)
Andrew Reis (2 papers)
Rivindu Weerasekera (4 papers)
Suranga Nanayakkara (23 papers)

Citations (107)

View on Semantic Scholar

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition (2008.06682v1)

Key Contributions:

Implications:

Related Papers