Multimodal Speech Emotion Recognition Using Audio and Text (1810.04635v1)

Published 10 Oct 2018 in cs.CL

Abstract: Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

Citations (271)

View on Semantic Scholar

Summary

The paper introduces a dual recurrent encoder model that integrates audio signals and textual data for improved emotion prediction.
It evaluates various model variations, with MDRE achieving a weighted average precision of 0.718 on the IEMOCAP dataset.
The study demonstrates that combining multimodal inputs in SER enhances robustness and paves the way for future affective computing developments.

Analyzing Multimodal Speech Emotion Recognition Utilizing Audio and Text

The paper "Multimodal Speech Emotion Recognition Using Audio and Text," authored by Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung proposes a novel approach for speech emotion recognition (SER) that leverages a deep dual recurrent encoder model. This model integrates both audio signals and textual data to provide a more comprehensive understanding of speech emotion compared to traditional audio-only models. The objective of this research aligns with broader efforts in affective computing to develop robust emotion classifiers, which are foundational components in enhancing human-computer interaction through emotional dialogue systems.

Methodological Contribution

The paper introduces a sophisticated architecture that encodes audio and text sequences independently using dual recurrent neural networks (RNNs), subsequently combining these encodings to predict an emotion class. The authors detail several model variations: the Audio Recurrent Encoder (ARE), Text Recurrent Encoder (TRE), Multimodal Dual Recurrent Encoder (MDRE), and an extension with attention mechanism (MDREA). Each model sequentially increases in complexity and aims to utilize different modalities of input to improve classification efficacy.

Key Methodological Components:

Audio Recurrent Encoder (ARE): Utilizes MFCC and prosodic features within an RNN to encode audio signals.
Text Recurrent Encoder (TRE): Leverages advanced automatic speech recognition (ASR) to transcribe speech into textual data, which is then encoded using an RNN.
Multimodal Dual Recurrent Encoder (MDRE): Combines audio encoding and textual information within a unified model framework to enhance emotion prediction.
MDREA: Introduces an attention mechanism to focus on textual components of speech that carry substantial emotional signals, guided by context from audio data.

Experimental Results

The proposed model was evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, well-known for its application in emotion recognition research. The dataset comprises four emotion categories: happy, sad, angry, and neutral. The MDRE achieved the highest accuracy over previous state-of-the-art methods, with accuracies ranging from 68.8% to 71.8%. When audio signals were paired with transcripts generated by state-of-the-art ASR systems, the MDRE model continued to perform well, showcasing the robustness of combining multimodal data for emotion prediction tasks.

Performance Highlights:

The MDRE model achieved a Weighted Average Precision (WAP) of 0.718, outperforming previous methodologies.
Incorporation of textual data through the TRE model significantly improved accuracy for emotions like happiness, highlighting the importance of semantic content.
The MDREA supported the hypothesis but was less successful than the MDRE likely due to complexity in model tuning, reinforced by data limitations.

Implications and Future Directions

This research provides significant insights into multimodal data integration for SER, emphasizing the potential advantages of incorporating text with audio in classifying emotions. By overcoming biases often found in audio-only models, particularly regarding the neutral class, the paper paves the way for future research in multimodal machine learning.

Potential future work could explore broader applications by integrating video data to potentially enrich the feature space for emotion recognition. Additionally, refining the attention mechanism to align more directly with temporal and semantic aspects of multimodal inputs may further enhance predictive accuracy. Expanding this methodology could contribute to improved human-computer interactions, artificial intelligence-driven customer service applications, and enhanced user experience in digital communication platforms.

PDF Markdown