EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations (2407.15458v4)
Abstract: The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech LLMs (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.
- Hugo Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” 2023.
- Rohan Anil et al., “PaLM 2 Technical Report,” 2023.
- Rohan Taori et al., “Stanford Alpaca: An Instruction-following LLaMA model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Jun Zhan et al., “Anygpt: Unified multimodal llm with discrete sequence modeling,” arXiv preprint arXiv:2402.12226, 2024.
- Xiaofei Wang et al., “Speechx: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
- Tianrui Wang et al., “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
- Ziqiang Zhang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
- Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- Dongchao Yang et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
- Zalán Borsos et al., “SoundStorm: Efficient Parallel Audio Generation,” arXiv preprint arXiv:2305.09636, 2023.
- Jiaming Wang et al., “LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT,” 2024.
- Chun-Yi Kuan et al., “Towards General-Purpose Text-Instruction-Guided Voice Conversion,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
- Chien yu Huang et al., “Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech,” 2023.
- Jinze Bai et al., “Qwen Technical Report,” 2023.
- “DESCo: Detecting Emotions from Smart Commands,” in 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), 2022.
- Haibin Wu et al., “Towards audio language modeling – an overview,” 2024.
- “Warp-Q: Quality Prediction for Generative Neural Speech Codecs,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- Haibin Wu et al., “Codec-SUPERB: An In-Depth Analysis of Sound Codec Models,” 2024.
- Ingo Siegert et al., “Emotion Intelligibility within Codec-Compressed and Reduced Bandwidth Speech,” in Speech Communication; 12. ITG Symposium, 2016.
- ““HIGH ON EMOTION “? HOW AUDIO CODECS INTERFERE WITH THE PERCEIVED CHARISMA AND EMOTIONAL STATES OF MEN AND WOMEN,” in 33. Konferenz Elektronische Sprachsignalverarbeitung, ESSV 2022, 2022.
- “Effects of band reduction and coding on speech emotion recognition,” in 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS), 2016.
- N. García et al., “Automatic emotion recognition in compressed speech using acoustic and non-linear features,” in 2015 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA), 2015.
- Abas Albahri et al., “Effect of speech compression on the automatic recognition of emotions,” International Journal of Signal Processing Systems, 2016.
- Alicia Flores Lotz et al., “Audio Compression and its Impact on Emotion Recognition in Affective Computing,” in Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017, 2017, pp. 1–8.
- Rithesh Kumar et al., “High-fidelity audio compression with improved RVQGAN,” Advances in Neural Information Processing Systems, 2024.
- Dongchao Yang et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
- Xin Zhang et al., “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
- J. Wagner et al., “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,” IEEE Transactions on Pattern Analysis & Machine Intelligence, 2023.
- Edmilson Morais et al., “Speech Emotion Recognition Using Self-Supervised Features,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- Alexandre Défossez et al., “High Fidelity Neural Audio Compression,” arXiv preprint arXiv:2210.13438, 2022.
- Yi-Chiao Wu et al., “Audiodec: An Open-Source Streaming High-Fidelity Neural Audio Codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- Zhihao Du et al., “FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec,” arXiv preprint arXiv:2309.07405, 2023.
- “Bigvgan: A universal neural vocoder with large-scale training,” arXiv preprint arXiv:2206.04658, 2022.
- “Soundstream: An end-to-end neural audio codec,” 2021.
- “Iso-mpeg-1 audio: A generic standard for coding of high-: Quality digital audio,” 1994.
- “High-quality, low-delay music coding in the opus codec,” 2016.
- “Iso/iec mpeg-2 advanced audio coding,” Journal of The Audio Engineering Society, vol. 45, pp. 789–814, 1997.
- Haibin Wu et al., “EMO-SUPERB: An In-depth Look at Speech Emotion Recognition,” arXiv preprint arXiv:2402.13018, 2024.
- Chi-Chun Lee et al., “Emotion recognition using a hierarchical binary decision tree approach,” Speech Communication, vol. 53, no. 9, pp. 1162–1171, 2011, Sensing Emotion and Affect - Facing Realism in Speech Processing.
- “Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, May 2019, pp. 5886–5890.
- H.-C. Chou et al., “Minority views matter: Evaluating speech emotion classifiers with human subjective annotations by an all-inclusive aggregation rule,” IEEE Transactions on Affective Computing, pp. 1–15, 2024.
- Christian Szegedy et al., “Rethinking the Inception Architecture for Computer Vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Shu wen Yang et al., “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021.
- Y. Cui et al., “Class-Balanced Loss Based on Effective Number of Samples,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations, 2019.
- “Macro f1 and macro f1,” arXiv preprint arXiv:1911.03347, 2019.
- Pablo Riera et al., “No Sample Left Behind: Towards a Comprehensive Evaluation of Speech Emotion Recognition Systems,” in Proc. SMM19, Workshop on Speech, Music and Mind 2019, Graz, Austria, September 2019, pp. 11–15.
- ITU, “Itu-t p.835, subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” International Telecommunication Union, 2003.
- Wenze Ren (9 papers)
- Yi-Cheng Lin (24 papers)
- Huang-Cheng Chou (9 papers)
- Haibin Wu (85 papers)
- Yi-Chiao Wu (42 papers)
- Chi-Chun Lee (11 papers)
- Hung-yi Lee (327 papers)
- Yu Tsao (200 papers)