Emotion Recognition in Speech using Cross-Modal Transfer in the Wild (1808.05561v1)

Published 16 Aug 2018 in cs.CV

Abstract: Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

Authors (4)

Samuel Albanie (81 papers)
Arsha Nagrani (62 papers)
Andrea Vedaldi (195 papers)
Andrew Zisserman (248 papers)

Citations (264)

View on Semantic Scholar

Summary

Emotion Recognition in Speech Using Cross-Modal Transfer in the Wild

The paper "Emotion Recognition in Speech Using Cross-Modal Transfer in the Wild" introduces an innovative approach to addressing the challenges of speech emotion recognition without relying on labeled audio datasets. The authors propose a method for leveraging cross-modal transfer learning, utilizing the correlation between facial expressions and speech emotions to distill emotion recognition capabilities from visual data to audio data.

Methodology

The presented approach involves a teacher-student framework, where a Convolutional Neural Network (CNN) designed for facial emotion recognition (the teacher) guides a separate network for speech emotion recognition (the student). The key hypothesis underpinning this work is that the emotional content of an individual's speech often mirrors their facial expressions during the articulation of that speech. By capitalizing on large volumes of unlabelled video data, the authors aim to transfer emotional context from the visual to the auditory domain through the distillation process.

Teacher Model Development: The teacher network, based on the Squeeze-and-Excitation architecture, is trained on facial emotion datasets to achieve state-of-the-art performance. The model incorporates advanced training techniques, including the use of frame-level emotion classifications and pooling strategies for face-tracks, to label facial expressions in the video data.
Student Model Training: The student network, operating on speech segments represented as spectrograms, uses the emotional labels provided by the teacher to learn effective audio emotion embeddings. Noteworthy is the use of a temperature-controlled softmax function during training, which facilitates the soft distribution of predicted emotions—a technique adapted from prior work on neural network distillation.
Dataset Utilization (EmoVoxCeleb): The research leverages the VoxCeleb dataset, a collection of diverse, naturalistic speaking face-tracks, to perform cross-modal labelling without manual annotation. Using SyncNet to ensure synchronicity between audio and visual data, the authors developed the EmoVoxCeleb dataset by applying the trained teacher model to predict emotions across millions of frames.

Results and Evaluation

The effectiveness of the cross-modal distillation is validated through extensive experimentation on benchmark datasets. The experimental results demonstrate that the student model performs significantly above chance and approaches the performance of audio emotion recognition models trained on labeled data, showcasing the viability of utilizing video datasets to reduce reliance on audio labels.

Benchmark Performance: Evaluations conducted on RML and eNTERFACE datasets show that the student model achieves classification accuracies significantly better than random baselines, albeit with performance that does not yet match the teacher model. This performance gap highlights the ongoing challenge of achieving parity between audio and visual modalities in emotion recognition tasks.

Implications and Future Directions

The proposed cross-modal transfer method offers a promising avenue for augmenting speech emotion recognition systems using vast, freely available video resources. By minimizing reliance on hand-annotated audio data, this technique could facilitate the development of robust emotion recognition models for real-world applications, where labeled data is scarce or impractical to obtain.

The paper suggests several pathways for future exploration, including the adaptation of cross-modal supervision to different types of unlabelled video datasets, incorporating broader emotional diversity and investigating non-speech facial movements as additional supervisory signals. The techniques and findings in this work have potential implications for advancing the state of the art in emotion recognition systems, particularly those designed for deployment in diverse and uncontrolled environments.

In conclusion, this research presents a compelling framework for cross-modal transfer learning, emphasizing the effective use of visual data to enrich the audio emotion recognition landscape. The approach's scalability and reliance on naturally occurring data position it as a valuable contribution to the field, opening new opportunities for development and application in artificial intelligence.

PDF Markdown