The paper presents an innovative approach to multimodal speech emotion recognition by leveraging "BERT-like" self-supervised learning (SSL) models specifically designed for processing speech and text inputs. These models are fine-tuned jointly to improve the extraction of emotional cues from both modalities. The research implemented and evaluated on three datasets— IEMOCAP, CMU-MOSEI, and CMU-MOSI—demonstrated enhancements in state-of-the-art (SOTA) performance metrics, showcasing the efficacy of these SSL architectures in the domain of affective computing.
Key Contributions:
- Modality Specific SSL Models:
- The paper employs 'BERT-like' SSL architectures, particularly suited for text and speech modalities, in which the text component utilizes RoBERTa, an extension of BERT that eschews the next sentence prediction task, while the speech component uses VQ-Wav2Vec and a novel Speech-BERT model formulated for the task. These SSL models are pretrained on substantial unlabeled datasets, thus enabling robust representation learning without requiring extensive labeled data.
- Fusion Mechanisms:
- The paper investigates two fusion strategies for combining speech and text embeddings:
- Shallow Fusion: A straightforward approach that concatenates the classification tokens () from each modality and passes them through a simple prediction head, resulting in superior performance with fewer parameters.
- Co-Attentional Fusion: A more complex method offering deeper interaction between modalities through a co-attentional mechanism. This strategy involves interchanging query vectors with key-value pairs from opposite modalities to facilitate a detailed cross-modal interaction.
- The paper investigates two fusion strategies for combining speech and text embeddings:
- Joint Fine-Tuning:
- Contrary to prior methods that use feature extractors statically, the researchers fine-tuned both Speech-BERT and RoBERTa jointly. This approach retained the inherent benefits of SSL models while specializing them for emotion recognition tasks, attaining higher accuracy and F1 scores for emotion classification tasks across multiple benchmark datasets.
- Ablation Studies:
- The experiments assessed various aspects, such as the impact of fusion strategies under fine-tuned and frozen states, and compared unimodal vs. multimodal performances, confirming that multimodal integration substantially outperforms unimodal methods.
- Performance Metrics:
- The approach achieves top performance based on metrics such as Binary Accuracy (BA), F1-score, 7-class accuracy, and Mean-Average-Error (MAE) across tested datasets, ultimately establishing new benchmarks for multimodal emotion recognition tasks.
Implications:
- The integration of similar architectural properties between Speech-BERT and RoBERTa simplifies the model design, adopting principles demonstrated in NLP for speech tasks. The success of BERT-like architectures in emotion recognition offers an adaptable framework that may be extended to other complex multimodal tasks.
- The paper exemplifies the potential of SSL paradigms in scenarios with limited labeled data, utilizing powerful pre-trained models to achieve improved performance in downstream applications with minimal additional computational resources.
Overall, this work provides a viable pathway for improving emotion recognition systems by using advanced SSL methods tailored for multimodal data, with implications for both academia and practical applications in affective computing systems.