EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations (2407.15458v4)

Published 22 Jul 2024 in eess.AS and cs.SD

Abstract: The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech LLMs (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information under various bitrate scenarios. We found that training codec models with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), particularly for emotions like sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides future speech technology developments to ensure new codecs maintain the integrity of emotional information in speech.

References (48)

Authors (8)

Wenze Ren (9 papers)
Yi-Cheng Lin (24 papers)
Huang-Cheng Chou (9 papers)
Haibin Wu (85 papers)
Yi-Chiao Wu (42 papers)
Chi-Chun Lee (11 papers)
Hung-yi Lee (327 papers)
Yu Tsao (200 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ArxivSound/status/1815925976037011723

https://twitter.com/ArxivSound/status/1816286938866839894

EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations (2407.15458v4)

Summary

Related Papers

Tweets