Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition (2401.10536v1)
Abstract: Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
- “Deep learning techniques for speech emotion recognition, from databases to models,” Sensors, vol. 21, no. 4, pp. 1249, 2021.
- “Domain invariant feature learning for speaker-independent speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2217–2230, 2022.
- “A comprehensive review of speech emotion recognition systems,” IEEE access, vol. 9, pp. 47795–47814, 2021.
- “Databases, features and classifiers for speech emotion recognition: a review,” International Journal of Speech Technology, vol. 21, pp. 93–120, 2018.
- “A systematic literature review of speech emotion recognition approaches,” Neurocomputing, vol. 492, pp. 245–263, 2022.
- “Speech emotion recognition: a comprehensive survey,” Wireless Personal Communications, vol. 129, no. 4, pp. 2525–2561, 2023.
- “Speechformer++: A hierarchical efficient framework for paralinguistic speech processing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 775–788, 2023.
- “Attention is all you need,” in NIPS, 2017.
- “A novel end-to-end speech emotion recognition network with stacked transformer layers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6289–6293.
- “Learning local to global feature aggregation for speech emotion recognition,” in INTERSPEECH, 2023.
- “Time-frequency transformer: A novel time frequency joint learning method for speech emotion recognition,” in International Conference on Neural Information Processing (ICONIP), 2023.
- “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- “Activation functions in deep learning: A comprehensive survey and benchmark,” Neurocomputing, 2022.
- “Neural collapse under cross-entropy loss,” Applied and Computational Harmonic Analysis, vol. 59, pp. 224–241, 2022.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
- “Design of speech corpus for mandarin text to speech,” in The blizzard challenge 2008 workshop, 2008.
- “Speech emotion recognition via an attentive time–frequency neural network,” IEEE Transactions on Computational Social Systems, 2022.
- “Deep neural networks for acoustic emotion recognition: Raising the benchmarks,” in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5688–5691.
- “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- “A method for stochastic optimization,” in International conference on learning representations (ICLR). San Diego, California;, 2015, vol. 5, p. 6.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- “Transformer in transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 15908–15919, 2021.
- “Hybrid curriculum learning for emotion recognition in conversation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 11595–11603.