TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition (2404.12979v2)
Abstract: One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
- Orhan Atila, Abdulkadir Şengür. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Applied Acoustics 2021; 182: 108260.
- Chengxin Chen, Pengyuan Zhang. CTA-RNN: Channel and Temporal-wise Attention RNN leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition. INTERSPEECH 2022: 4730-4734.
- Alexandros Georgogiannis, Vassilis Digalakis. Speech Emotion Recognition using non-linear Teager energy based features in noisy environments. EUSIPCO 2012: 2045-2049.
- Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. ACM Multimedia 2015: 1015-1018.
- Laurens Van der Maate, Geoffrey Hinton. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008; 9(11).