The paper presents a transformer-based multi-task framework for learning generalized multi-modal embeddings from audio, visual, and textual inputs, with an application to emotion recognition. The paper is motivated by the prevalence of large-scale embedding techniques in natural language processing and computer vision, and it extends these principles to jointly handle multi-modal data where the scarcity of labeled emotion datasets necessitates transfer learning from large-scale datasets.
The approach is characterized by the following technical aspects:
- Multi-Task Pretraining:
The model is trained using two auxiliary tasks:
- Character-Level Automatic Speech Recognition (ASR): Serves as an analog to the LLMing task in the audio domain, providing semantic structure.
- Person Identification: Exploits speaker-discriminative information available in open datasets (e.g., VoxCeleb2) to encourage the network to capture robust identity features.
These tasks are combined using a weighted loss function (with weights 0.8 for ASR and 0.2 for person identification), guiding the training process toward a representation that is resilient and generalizable across downstream tasks.
- Cross-Modal Attention and Transformer Architecture:
The network uses a cross-modal transformer architecture that projects features from the visual and textual domains into the audio space via cross-modal attention modules. The attention mechanism is defined as:
where is the query matrix, is the key matrix, is the value matrix, and represents the dimensionality of the key or query. - Encoder Configuration: The core encoder consists of an audio-specific subnetwork and two additional transformer modules to fuse visual and textual modalities. A weighted sum of the modality-specific features, with modality weights set at 0.4 for audio and visual and 0.2 for text, is used to form the final embeddings. - Decoder Details: For the ASR task, a Transformer decoder is employed, while an affine transformation over averaged embeddings is used for speaker classification in the person identification task.
- Downstream Task and Transfer Learning:
The learned embeddings are fine-tuned on the emotion recognition task using the CMU-MOSEI dataset. The dataset comprises six emotion classes (happy, sad, angry, disgust, surprise, and fear), with labels binarized from Likert-scale annotations. The experimental evaluation contrasts two architectures:
- A late fusion model employing bidirectional GRUs for each modality,
- The transformer-based model described above.
The transformer baseline demonstrates superior performance over the late fusion approach, with numerical improvements observed in majority of the emotion categories. For instance, the transformer baseline achieves 67.8% weighted accuracy and 67.5 F1-score for the happy emotion, with similar competitive improvements in other emotion categories.
- Ablation Studies:
The paper also conducts ablation experiments to assess the impact of missing modalities:
- When individual modalities (video or text) are removed (via assigning a zero weight during inference), the learned embedding representations continue to outperform the baseline system.
- Notably, the impact of the text modality is less critical owing to the use of pre-trained GloVe (Global Vectors for Word Representation) embeddings, which already encode substantial semantic information.
- The analysis indicates an absolute improvement of 8.6% in weighted average accuracy when text is added to audio, as opposed to a 3.3% improvement when visual data is incorporated.
- Experimental Setup and Implementation Details:
- Datasets: VoxCeleb2 is used for multi-task training, filtering non-English utterances using the likelihood scores from a TDNN ASR model trained on Librispeech. CMU-MOSEI is leveraged for evaluating the downstream emotion recognition task.
- Feature Extraction: Audio is represented by 200-dimensional features (40-dimensional LFBE vectors stacked over 5 frames), visual features are 4096-dimensional (extracted from VGG-16), and text features are 300-dimensional GloVe vectors.
- Model Configuration: The transformer encoder uses 4 layers with a feature dimension of 512, 4 attention heads, and a feed-forward layer of 200 dimensions. The cross-modal transformer employs 4 layers to project video and text into the audio domain.
- Training Protocol: The model is trained with a PyTorch-based implementation and learning schedules inspired by prior transformer-based speech recognition models.
- Conclusions and Future Directions:
The paper demonstrates that multi-task pretraining on large-scale multi-modal datasets can yield robust embeddings that enhance performance on the emotion recognition task, achieving state of the art results on CMU-MOSEI. The numerical improvements, such as over 1% absolute increase in weighted accuracy in certain emotion classes, underscore the efficacy of the approach. Future work is suggested to address the limitation of the architecture’s dependency on the audio modality during inference, possibly by integrating unsupervised methods (e.g., skip-thought or self-supervised training techniques) and expanding multi-task objectives to include visual tasks like landmark detection for further generalization.
This detailed multi-modal, multi-task framework represents a well-structured attempt at leveraging large-scale datasets for improved embedding learning and shows promise in scenarios where certain modalities might be partially absent during inference.