Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition (1907.06078v5)

Published 13 Jul 2019 in cs.SD and eess.AS

Abstract: Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. This semi-supervised learning essentially helps to improve generalisation of our framework and thus leads to improvements in SER performance. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available datasets.

Citations (94)

View on Semantic Scholar

Summary

The paper proposes a novel multi-task framework integrating adversarial autoencoders with auxiliary tasks to bolster SER performance.
It leverages gender and speaker recognition to enhance feature learning and regularize the primary emotion recognition task.
Compared to single-task models, the approach achieves state-of-the-art results on datasets such as IEMOCAP and MSP-IMPROV.

Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition

The paper "Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition" introduces a novel framework to address the challenge of limited emotion-centric datasets that currently hinder the efficacy of Speech Emotion Recognition (SER) systems. The framework leverages the strengths of multi-task learning (MTL) and adversarial autoencoders (AAEs) to improve the performance on SER tasks significantly. By incorporating auxiliary tasks like gender identification and speaker recognition, which can be trained on more extensive and readily available datasets, the primary SER task benefits from improved feature learning through shared representations.

Technical Approach

The approach rests on a multi-task learning framework that includes gender and speaker recognition as auxiliary tasks, paired with an adversarial autoencoder for robust and discriminative feature extraction. This setup facilitates semi-supervised learning by integrating both labeled and unlabeled data within the autoencoder and classification networks, improving generalization and enhancing SER performance.

Adversarial Autoencoder (AAE): The AAE model combines traditional autoencoders with adversarial training to learn latent representations that conform to a desired prior distribution. This component focuses on learning powerful unsupervised features that feed into the SER sub-tasks.
Multi-Task Learning Framework: The choice of auxiliary tasks takes advantage of the same modality (speech), tapping into a shared source of data, which assists in regularizing models and identifying high-level discriminative features.
Semi-Supervised Learning: By combining the unsupervised learning capabilities of the autoencoder with supervised multi-task classification networks, the framework effectively utilizes unlabeled emotional data for auxiliary tasks, bolstering primary SER task performance.

Results and Discussion

On comparing the proposed model against single-task frameworks and other methods utilizing transfer learning and synthetic data generation, the research shows that integrating auxiliary tasks via a multi-task semi-supervised approach results in noticeable improvements in SER accuracy. The framework achieves state-of-the-art results across multiple public datasets like IEMOCAP and MSP-IMPROV, demonstrating improved weighted and unweighted accuracies for categorical and dimensional emotion recognition tasks.

Critically, the inclusion of auxiliary tasks effectively increases the size and diversity of data used during training, which is pivotal given the paucity of labeled emotional speech data. The architecture also demonstrates promising results in cross-corpus evaluations, emphasizing its generalization ability across different datasets without resorting to corpus-specific tuning.

Implications

The novel use of AAE in a multi-task framework paves the way for future exploration of generative models in SER, potentially integrating reinforcement learning to enhance interactive applications in various domains such as healthcare, customer service, and entertainment. Furthermore, the practical deployment of such an SER framework would significantly benefit domains reliant on human-computer interaction by providing systems capable of nuanced emotion recognition with heightened accuracy and adaptiveness.

Ultimately, this paper contributes to the ongoing development of robust SER systems, aligning technical advancements in machine learning with real-world applicability in affective computing. Future investigations could explore a wider range of auxiliary tasks and incorporate continuous emotion dimensions to further enrich emotion recognition systems' contextual understanding. The reliance on MTL with unlabeled data propounds an exciting direction for ameliorating other domain-specific classification challenges constrained by limited data.

PDF Markdown

Related Papers

YouTube

Show All Videos