Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction (2401.13260v2)

Published 24 Jan 2024 in cs.CL, cs.MM, cs.SD, and eess.AS

Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. M. J. Al-Dujaili and A. Ebrahimi-Moghadam, “Speech emotion recognition: a comprehensive survey,” Wireless Personal Communications, vol. 129, pp. 2525–2561, 2023.
  2. “Semi-supervised multimodal emotion recognition with consensus decision-making and label correction,” Proc. MRAC, pp. 67–73, 2023.
  3. “Mmer: Multimodal multi-task learning for speech emotion recognition,” Proc. Interspeech, pp. 1209–1213, 2023.
  4. “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” Proc. Interspeech, pp. 379–383, 2020.
  5. “Isnet: Individual standardization network for speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1803–1814, 2022.
  6. “Speechformer: A hierarchical efficient framework incorporating the characteristics of speech,” Proc. Interspeech, pp. 346–350, 2022.
  7. “Mgat: Multi-granularity attention based transformers for multi-modal emotion recognition,” Proc. ICASSP, pp. 1–5, 2023.
  8. “wav2vec: Unsupervised pre-training for speech recognition,” Proc. Interspeech, pp. 3465–3469, 2019.
  9. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  10. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2022.
  11. “Bert: Pre-training of deep bidirectional transformers for language understanding,” Proc. NAACL-HLT, p. 4171–4186, 2019.
  12. “Deberta: Decoding-enhanced bert with disentangled attention,” Proc. ICLR, 2021.
  13. “Roberta: a robustly optimized bert pretraining approach,” Proc. ICLR, 2020.
  14. B. Lin and L. Wang, “Robust multi-modal speech emotion recognition with asr error adaptation,” Proc. ICASSP, pp. 1–5, 2023.
  15. “On the effectiveness of asr representations in real-world noisy speech emotion recognition,” arXiv:2311.07093, 2023.
  16. “Speech emotion recognition based on self-attention weight correction for acoustic and text features,” IEEE Access, vol. 10, pp. 115732–115743, 2022.
  17. “Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition,” Proc. ICASSP, pp. 6314–6318, 2021.
  18. “Temporal attention convolutional network for speech emotion recognition with latent representation,” Proc. Interspeech, pp. 2337–2341, 2020.
  19. D. Krishna and A. Patil, “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” Proc. Interspeech, pp. 4243–4247, 2020.
  20. “Mir-gan: Refining frame-level modality-invariant representations with adversarial network for audio-visual speech recognition,” Proc. ACL, pp. 11610–11625, 2023.
  21. “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” Proc. ACMMM, pp. 1122–1131, 2020.
  22. “Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” Proc. AAAI, pp. 10790–10797, 2021.
  23. Y. Yao and R. Mihalcea, “Modality-specific learning rates for effective multimodal additive late-fusion,” Proc. ACL, pp. 1824–1834, 2022.
  24. “Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences,” Proc. ACMMM, pp. 1708–1717, 2022.
  25. “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv:1609.08144, 2016.
  26. “ED-CEC: Improving rare word recognition using asr postprocessing based on error detection and context-aware error correction,” Proc. ASRU, pp. 1–6, 2023.
  27. “Attention is all you need,” Proc. NeurIPS, vol. 30, pp. 6000–6010, 2017.
  28. “Multimodal transformer for unaligned multimodal language sequences,” Proc. ACL, pp. 6558–6569, 2019.
  29. “Layer normalization,” arXiv:1607.06450, 2016.
  30. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” Proc. ICCV, pp. 1026–1034, 2015.
  31. D. P Kingma and J. Ba, “Adam: A method for stochastic optimization,” Proc. ICLR, pp. 1–15, 2015.
  32. “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, pp. 335–359, 2008.
  33. “Robust speech recognition via large-scale weak supervision,” Proc. ICML, 2023.
  34. “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” Proc. NeurIPS, pp. 21708–21719, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiajun He (28 papers)
  2. Xiaohan Shi (4 papers)
  3. Xingfeng Li (4 papers)
  4. Tomoki Toda (106 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com