Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition (2401.03000v1)
Abstract: This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.
- Gated mechanism for attention based multimodal sentiment analysis, 2020.
- Dragos Datcu and Leon J. M. Rothkrantz. Semantic Audiovisual Data Fusion for Automatic Emotion Recognition, chapter 16, pages 411–435. John Wiley and Sons, Ltd, 2015.
- A transformer-based joint-encoding for emotion recognition and sentiment analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 1–7, Seattle, USA, July 2020. Association for Computational Linguistics.
- Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. In Proc. Interspeech 2020, pages 2347–2351, 2020.
- COGMEN: COntextualized GNN based multimodal emotion recognitioN. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4148–4164, Seattle, United States, July 2022. Association for Computational Linguistics.
- Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY, USA, 2010. Association for Computing Machinery.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
- Distilling the knowledge in a neural network, 2015.
- Efficiently modeling long sequences with structured state spaces, 2022.
- Liquid structural state-space models, 2022.
- Simplified state space layers for sequence modeling, 2023.