Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech (1706.00612v1)

Published 2 Jun 2017 in cs.CL

Abstract: Speech emotion recognition is an important and challenging task in the realm of human-computer interaction. Prior work proposed a variety of models and feature sets for training a system. In this work, we conduct extensive experiments using an attentive convolutional neural network with multi-view learning objective function. We compare system performance using different lengths of the input signal, different types of acoustic features and different types of emotion speech (improvised/scripted). Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that the recognition performance strongly depends on the type of speech data independent of the choice of input features. Furthermore, we achieved state-of-the-art results on the improvised speech data of IEMOCAP.

Citations (214)

View on Semantic Scholar

Summary

The paper introduces an Attentive Convolutional Neural Network (ACNN) for speech emotion recognition and analyzes the impact of input features, signal length, and acted versus improvised speech.
Experiments show Log Mel filter-banks and MFCCs yield superior performance, and the ACNN maintains strong accuracy even with signal lengths as short as 2 seconds.
Results indicate better performance on improvised speech compared to scripted, with notable errors in distinguishing between similar emotions like 'angry' and 'happy'.

Attentive Convolutional Neural Network for Speech Emotion Recognition

The paper "Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech" presents a detailed investigation into the efficacy of an Attentive Convolutional Neural Network (ACNN) model for speech emotion recognition. The authors, Michael Neumann and Ngoc Thang Vu, conduct comprehensive experiments on the IEMOCAP database, which reveal the pivotal roles of types of speech data, input features, and signal lengths in system performance.

Overview

Speech emotion recognition (SER) poses considerable challenges in developing human-computer interaction systems due to the nuanced nature of emotional expressions and limited dataset availability. This research employs an ACNN that combines the structural representational power of CNNs with the dynamic context-attentive capabilities provided by attention mechanisms. Such integration aims to improve discernment and accentuation of emotionally pertinent audio segments.

Methodology

The ACNN model is structured with a convolutional layer, pooling layer, and an attention mechanism, integrated into a softmax output layer. This architecture allows the network to emphasize significant portions of input data dynamically, facilitating improved emotional classification.

The authors adopt a multi-view (MV) learning strategy, leveraging both categorical emotional labels and continuous activation/valence dimensions. This dual representation is posited to enhance the model's ability to capture complex emotional landscapes, simultaneously reflecting activation intensity and emotional valence.

Input Features and Signal Length

The investigation evaluates various acoustic features, including log Mel filter-banks, MFCCs, a prosody feature set, and the eGeMAPS configuration. Results underscore that performance significantly varies with these features. Log Mel filter-banks and MFCCs consistently outperform the others, arguably due to their effectiveness in capturing spectral characteristics crucial for discriminating emotions.

A focus of this paper is the effect of input signal length on recognition accuracy. The experiments reveal that ACNN models maintain robust performance with abbreviated signal lengths, down to 2 seconds. This finding is particularly promising for real-time applications where early emotion prediction is valuable.

Results and Analysis

The results indicate that performance is heavily contingent upon the nature of the speech data. The ACNN achieves state-of-the-art recognition on improvised speech, suggesting its adaptability to nuanced, natural discourse. The model's performance on scripted speech is relatively diminished, highlighting variability based on speech spontaneity.

Moreover, the experiments reveal intriguing error patterns. Notably, high errors manifest in distinguishing emotively similar classes such as 'angry' and 'happy,' indicative of intrinsic challenges presented by closely related emotional states.

Implications and Future Directions

This research contributes significant insights to the speech emotion recognition domain by demonstrating the necessity for adaptable models capable of harnessing diverse features and understanding variance in speech data. Furthermore, achieving commendable results with concise audio snippets opens avenues for developing responsive, real-time affective computing systems.

Future work may explore extending this model's applicability across varied datasets, potentially integrating with larger, more diverse corpora to enhance the model's generalizability. Additionally, deeper exploration into error causality and potential mitigative strategies could refine emotion recognition system accuracy further. The integration of contextual and multi-modal data could also provide richer emotional inference capabilities, enhancing the robustness of human-computer interaction systems.

PDF Markdown