Multi-modal embeddings using multi-task learning for emotion recognition (2009.05019v1)

Published 10 Sep 2020 in cs.CL and cs.LG

Abstract: General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embeddings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.

PDF Abstract

The paper presents a transformer-based multi-task framework for learning generalized multi-modal embeddings from audio, visual, and textual inputs, with an application to emotion recognition. The paper is motivated by the prevalence of large-scale embedding techniques in natural language processing and computer vision, and it extends these principles to jointly handle multi-modal data where the scarcity of labeled emotion datasets necessitates transfer learning from large-scale datasets.

The approach is characterized by the following technical aspects:

Multi-Task Pretraining:

The model is trained using two auxiliary tasks:

Character-Level Automatic Speech Recognition (ASR): Serves as an analog to the LLMing task in the audio domain, providing semantic structure.
Person Identification: Exploits speaker-discriminative information available in open datasets (e.g., VoxCeleb2) to encourage the network to capture robust identity features.

These tasks are combined using a weighted loss function (with weights 0.8 for ASR and 0.2 for person identification), guiding the training process toward a representation that is resilient and generalizable across downstream tasks.

Cross-Modal Attention and Transformer Architecture:

The network uses a cross-modal transformer architecture that projects features from the visual and textual domains into the audio space via cross-modal attention modules. The attention mechanism is defined as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V,$

where $Q$ is the query matrix, $K$ is the key matrix, $V$ is the value matrix, and $d$ represents the dimensionality of the key or query. - Encoder Configuration: The core encoder consists of an audio-specific subnetwork and two additional transformer modules to fuse visual and textual modalities. A weighted sum of the modality-specific features, with modality weights set at 0.4 for audio and visual and 0.2 for text, is used to form the final embeddings. - Decoder Details: For the ASR task, a Transformer decoder is employed, while an affine transformation over averaged embeddings is used for speaker classification in the person identification task.

Downstream Task and Transfer Learning:

The learned embeddings are fine-tuned on the emotion recognition task using the CMU-MOSEI dataset. The dataset comprises six emotion classes (happy, sad, angry, disgust, surprise, and fear), with labels binarized from Likert-scale annotations. The experimental evaluation contrasts two architectures:

A late fusion model employing bidirectional GRUs for each modality,
The transformer-based model described above.

The transformer baseline demonstrates superior performance over the late fusion approach, with numerical improvements observed in majority of the emotion categories. For instance, the transformer baseline achieves 67.8% weighted accuracy and 67.5 F1-score for the happy emotion, with similar competitive improvements in other emotion categories.

Ablation Studies:

The paper also conducts ablation experiments to assess the impact of missing modalities:

When individual modalities (video or text) are removed (via assigning a zero weight during inference), the learned embedding representations continue to outperform the baseline system.
Notably, the impact of the text modality is less critical owing to the use of pre-trained GloVe (Global Vectors for Word Representation) embeddings, which already encode substantial semantic information.
The analysis indicates an absolute improvement of 8.6% in weighted average accuracy when text is added to audio, as opposed to a 3.3% improvement when visual data is incorporated.
- Experimental Setup and Implementation Details:
Datasets: VoxCeleb2 is used for multi-task training, filtering non-English utterances using the likelihood scores from a TDNN ASR model trained on Librispeech. CMU-MOSEI is leveraged for evaluating the downstream emotion recognition task.
Feature Extraction: Audio is represented by 200-dimensional features (40-dimensional LFBE vectors stacked over 5 frames), visual features are 4096-dimensional (extracted from VGG-16), and text features are 300-dimensional GloVe vectors.
Model Configuration: The transformer encoder uses 4 layers with a feature dimension of 512, 4 attention heads, and a feed-forward layer of 200 dimensions. The cross-modal transformer employs 4 layers to project video and text into the audio domain.
Training Protocol: The model is trained with a PyTorch-based implementation and learning schedules inspired by prior transformer-based speech recognition models.
- Conclusions and Future Directions:

The paper demonstrates that multi-task pretraining on large-scale multi-modal datasets can yield robust embeddings that enhance performance on the emotion recognition task, achieving state of the art results on CMU-MOSEI. The numerical improvements, such as over 1% absolute increase in weighted accuracy in certain emotion classes, underscore the efficacy of the approach. Future work is suggested to address the limitation of the architecture’s dependency on the audio modality during inference, possibly by integrating unsupervised methods (e.g., skip-thought or self-supervised training techniques) and expanding multi-task objectives to include visual tasks like landmark detection for further generalization.

This detailed multi-modal, multi-task framework represents a well-structured attempt at leveraging large-scale datasets for improved embedding learning and shows promise in scenarios where certain modalities might be partially absent during inference.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Aparna Khare (12 papers)
Srinivas Parthasarathy (8 papers)
Shiva Sundaram (13 papers)

Citations (17)

View on Semantic Scholar

Multi-modal embeddings using multi-task learning for emotion recognition (2009.05019v1)

Related Papers