Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition (2309.10294v1)
Abstract: In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful LLM, GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning. Experiments and ablation studies on the IEMOCAP dataset demonstrate the effectiveness of our method, compared with other data augmentation methods, and data augmentation with other synthetic data.
- “Survey of deep representation learning for speech emotion recognition,” Trans. of TAC, 2021.
- “SUPPERB: Speech processing universal performance benchmark,” Proc. of Interspeech, 2021.
- “IEMOCAP: Interactive emotional dyadic motion capture database,” J. of LRE, 2008.
- “wav2vec: Unsupervised pre-training for speech recognition,” Proc. Interspeech, 2019.
- “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. of ICLR, 2019.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Proc. of NeurIPS, 2020.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” Trans. of TASLP, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” J. of JSTSP, 2022.
- “Data2vec: A general framework for self-supervised learning in speech, vision and language,” Proc. of ICML, 2022.
- “MT4SSL: Boosting self-supervised speech representation learning by integrating multiple targets,” Proc. of Interspeech, 2023.
- “Tessp: text-enhanced self-supervised speech pre-training,” arXiv preprint arXiv:2211.13443, 2022.
- “MMspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition,” Proc. of Interspeech, 2023.
- “Pushing the limits of unsupervised unit discovery for SSL speech representation,” Proc. of Interspeech, 2023.
- “Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,” in Proc. of Interspeech, 2023.
- “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proc. of ICASSP, 2022.
- “Integrating emotion recognition with speech recognition and speaker diarisation for conversations,” in Proc. of Interspeech, 2023.
- “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc. of Interspeech, 2021.
- “Exploration of a self-supervised speech model: A study on emotional corpora,” in Proc. of SLT, 2022.
- “Speech emotion recognition using self-supervised features,” in Proc. of ICASSP, 2022.
- “Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition,” in Proc. of ICASSP, 2023.
- “Speaker normalization for self-supervised speech emotion recognition,” in Proc. of ICASSP, 2022.
- “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
- “Dawn of the Transformer era in speech emotion recognition: closing the valence gap,” Trans. of TPAMI, 2023.
- “Towards paralinguistic-only speech representations for end-to-end speech emotion recognition,” Proc. of Interspeech, 2023.
- “Vesper: A compact and effective pretrained model for speech emotion recognition,” arXiv preprint arXiv:2307.10757, 2023.
- “Audio augmentation for speech recognition,” in Proc. of Interspeech, 2015.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” Proc. of Interspeech, 2019.
- “mixup: Beyond empirical risk minimization,” Proc. of ICLR, 2018.
- “Direct modelling of speech emotion from raw speech,” Proc. of Interspeech, 2019.
- “On the robustness of speech emotion recognition for human-robot interaction with deep neural networks,” in Proc. of IROS, 2018.
- “Multitask learning from augmented auxiliary data for improving speech emotion recognition,” Trans. of TAC, 2022.
- “Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition,” in Proc. of Interspeech, 2020.
- “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in Proc. of ICASSP, 2020.
- “Effects of data augmentations on speech emotion recognition,” J. of Sensors, 2022.
- “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition.,” in Proc. of Interspeech, 2019.
- “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in Proc. of ICASSP, 2020.
- “A preliminary study on augmenting speech emotion recognition using a diffusion model,” in Proc. of Interspeech, 2023.
- “Refashioning emotion recognition modelling: The advent of generalised large models,” arXiv preprint arXiv:2308.11578, 2023.
- OpenAI, “GPT-4 technical report,” 2023.
- “Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance,” in Proc. of ICASSP, 2023.
- “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
- “SSML: A speech synthesis markup language,” J. of Speech Communication, 1997.
- “Domain-adversarial training of neural networks,” J. of JMLR, 2016.
- Ziyang Ma (73 papers)
- Wen Wu (103 papers)
- Zhisheng Zheng (15 papers)
- Yiwei Guo (29 papers)
- Qian Chen (264 papers)
- Shiliang Zhang (132 papers)
- Xie Chen (165 papers)