Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Alignment for Multimodal Emotion Recognition from Speech (1909.05645v2)

Published 6 Sep 2019 in cs.CL, cs.SD, and eess.AS

Abstract: Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haiyang Xu (67 papers)
  2. Hui Zhang (405 papers)
  3. Kun Han (39 papers)
  4. Yun Wang (229 papers)
  5. Yiping Peng (13 papers)
  6. Xiangang Li (46 papers)
Citations (115)

Summary

We haven't generated a summary for this paper yet.