Emotion Recognition in Audio and Video Using Deep Neural Networks (2006.08129v1)

Published 15 Jun 2020 in eess.AS, cs.CV, cs.LG, and cs.SD

Abstract: Humans are able to comprehend information from multiple domains for e.g. speech, text and visual. With advancement of deep learning technology there has been significant improvement of speech recognition. Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency. There are still many challenges to improve accuracy. In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition. With different architectures explored, we find (CNN+RNN) + 3DCNN multi-model architecture which processes audio spectrograms and corresponding video frames giving emotion prediction accuracy of 54.0% among 4 emotions and 71.75% among 3 emotions using IEMOCAP[2] dataset.

Authors (2)

Mandeep Singh (10 papers)
Yuan Fang (146 papers)

Citations (14)

View on Semantic Scholar

Summary

Emotion Recognition in Audio and Video Using Deep Neural Networks

The paper "Emotion Recognition in Audio and Video Using Deep Neural Networks" by Mandeep Singh and Yuan Fang from Stanford University examines the convergence of audio and video modalities through deep neural networks to enhance the efficacy of emotion recognition. This paper is pivotal, considering humans' natural capability to interpret information from multiple domains, such as speech and visual cues, to perceive emotions—a task still fraught with challenges in the domain of machine understanding.

Methodology and Approach

The paper leverages the IEMOCAP dataset, containing audiovisual data, to explore the potential of various neural network architectures, particularly combinations of CNNs and RNNs, in recognizing emotions. The researchers employed several configurations, including CNN only, a CNN+RNN hybrid, and a more sophisticated CNN+RNN+3DCNN model, the latter integrating audio spectrograms and video data.

The authors methodically approached different architectures:

Audio Models: Three primary architectures were developed—CNN, CNN+LSTM, and CNN+RNN. Each aimed to extract meaningful patterns from audio spectrograms to deduce the underlying emotional state. The paper found that the CNN+RNN architecture stood out, delivering an accuracy of 54% over four emotion categories.
Audio+Video Models: Expanding upon audio-only models, the research incorporated video data, leading to the more intricate CNN+RNN+3DCNN architecture. This model aims to exploit the synergistic relationship between audio and visual elements to improve emotion recognition, achieving an accuracy of 71.75% when limiting the emotions to three categories.

The work included extensive data preprocessing to ensure that both audio and visual data were aligned correctly to ensure validity in the multimodal approach. They addressed potential issues with data imbalance, particularly with the happiness category, by augmenting the dataset through oversampling.

Numerical Results

This research yielded significant numerical results:

Accuracy Achievements: For the CNN+RNN model, the accuracy was benchmarked against a four-emotion model and found a comparable performance to existing research at 54%. The combination model CNN+RNN+3DCNN achieved a substantial accuracy of 71.75%, focusing on sad, angry, and neutral emotions.
Confusion Matrix Analysis: Notably, the happiness emotion's recognition remained elusive, indicating the current model's limitations in data diversity and its complexity.

Implications and Future Directions

The findings in the paper underscore both the progress and the challenges in the field of multimodal emotion recognition. The improved accuracy of CNN+RNN+3DCNN models bodes well for the integration of heterogeneous data streams in enhancing AI perception capabilities. The paper also highlights the need for more scrutinized data augmentation strategies, advanced noise reduction techniques, and facial feature extraction through automated video frame analysis.

Future research directions include:

Enhanced Noise Removal: Investigating advanced noise-filtering algorithms to refine audio spectrograms and potentially improve the classification accuracy.
Facial Feature Focus: Refining the facial data component by employing automated cropping and advanced face-detection algorithms, which may address current shortcomings attributed to actors not facing the camera.
Data Augmentation Techniques: Leveraging sophisticated data augmentation (beyond simple rotations and croppings) to simulate more diverse emotional expressions and improve the generalization of the models.

This paper contributes meaningfully to the ongoing discourse on audio-visual synthesis for emotion recognition, casting insights on the synergy potential between auditory and visual nerves in AI models. While hurdles remain, particularly with limited emotions, the path forward in integrating robust visual component analysis offers a promising avenue for advancing emotion recognition systems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos