Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition (2403.19224v1)

Published 28 Mar 2024 in cs.SD and eess.AS

Abstract: The mainstream paradigm of speech emotion recognition (SER) is identifying the single emotion label of the entire utterance. This line of works neglect the emotion dynamics at fine temporal granularity and mostly fail to leverage linguistic information of speech signal explicitly. In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. We first extend typical neural transducer with emotion joint network to construct emotion lattice for fine-grained SER. Then we propose lattice max pooling on the alignment lattice to facilitate distinguishing emotional and non-emotional frames. To adapt fine-grained SER to transducer inference manner, we further make blank, the special symbol of ASR, serve as underlying emotion indicator as well, yielding Factorized Emotion Neural Transducer. For typical utterance-level SER, our ENT models outperform state-of-the-art methods on IEMOCAP in low word error rate. Experiments on IEMOCAP and the latest speech emotion diarization dataset ZED also demonstrate the superiority of fine-grained emotion modeling. Our code is available at https://github.com/ECNU-Cross-Innovation-Lab/ENT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc. Interspeech 2021, pp. 3400–3404, 2021.
  2. “Speech emotion recognition via two-stream pooling attention with discriminative channel weighting,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  3. “Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  4. “Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  5. “Multi-Scale Temporal Transformer For Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 3652–3656.
  6. “EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 3597–3601.
  7. “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  8. “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536.
  9. “Towards temporal modelling of categorical speech emotion recognition,” Interspeech 2018, 2018.
  10. “Speech emotion diarization: Which emotion appears when?,” arXiv preprint arXiv:2306.12991, 2023.
  11. “Dilated context integrated network with cross-modal consensus for temporal emotion localization in videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5083–5092.
  12. “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 2227–2231.
  13. “Advancing multiple instance learning with attention modeling for categorical speech emotion recognition,” Proc. Interspeech 2020, pp. 2357–2361, 2020.
  14. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  15. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  16. “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
  17. “Cta-rnn: Channel and temporal-wise attention rnn leveraging pre-trained asr embeddings for speech emotion recognition,” in Interspeech, 2022.
  18. “End-to-end speech emotion recognition combined with acoustic-to-word asr model.,” in Interspeech, 2020, pp. 501–505.
  19. “Fusing asr outputs in joint training for speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7362–7366.
  20. “Speech emotion recognition with multi-task learning.,” in Interspeech, 2021, vol. 2021, pp. 4508–4512.
  21. “Factorized neural transducer for efficient language model adaptation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8132–8136.
  22. “Extending RNN-T-based speech recognition systems with emotion and language classification,” in Proc. Interspeech 2022, 2022, pp. 546–549.
  23. “Mining effective negative training samples for keyword spotting,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7444–7448.
  24. “Wekws: A production first small-footprint end-to-end keyword spotting toolkit,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  25. “Hybrid autoregressive transducer (hat),” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6139–6143.
  26. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  27. “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
  28. “Rnn-transducer with stateless prediction network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7049–7053.
  29. “Internal language model estimation for domain-adaptive end-to-end speech recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 243–250.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Siyuan Shen (22 papers)
  2. Yu Gao (281 papers)
  3. Feng Liu (1213 papers)
  4. Hanyang Wang (17 papers)
  5. Aimin Zhou (43 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.