Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition (2401.03000v1)

Published 4 Jan 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (11)
  1. Gated mechanism for attention based multimodal sentiment analysis, 2020.
  2. Dragos Datcu and Leon J. M. Rothkrantz. Semantic Audiovisual Data Fusion for Automatic Emotion Recognition, chapter 16, pages 411–435. John Wiley and Sons, Ltd, 2015.
  3. A transformer-based joint-encoding for emotion recognition and sentiment analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 1–7, Seattle, USA, July 2020. Association for Computational Linguistics.
  4. Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. In Proc. Interspeech 2020, pages 2347–2351, 2020.
  5. COGMEN: COntextualized GNN based multimodal emotion recognitioN. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4148–4164, Seattle, United States, July 2022. Association for Computational Linguistics.
  6. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, page 1459–1462, New York, NY, USA, 2010. Association for Computing Machinery.
  7. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
  8. Distilling the knowledge in a neural network, 2015.
  9. Efficiently modeling long sequences with structured state spaces, 2022.
  10. Liquid structural state-space models, 2022.
  11. Simplified state space layers for sequence modeling, 2023.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets