Speaker Characterization by means of Attention Pooling (2405.04096v1)
Abstract: State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
- W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 2018, pp. 74–81.
- W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795, 2019.
- Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, “Spatial pyramid encoding with convex length normalization for text-independent speaker verification,” arXiv preprint, p. arXiv:1906.08333, 2019.
- D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech 2017, 2017, pp. 999–1003.
- Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech 2018, 2018, pp. 3573–3577.
- M. India, P. Safari, and J. Hernando, “Self Multi-Head Attention for Speaker Recognition,” in Proc. Interspeech 2019, 2019, pp. 4305–4309.
- D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition using X-vectors,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 2018, pp. 105–111.
- R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “X-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7169–7173.
- H. Meng, T. Yan, F. Yuan, and H. Wei, “Speech emotion recognition from 3d log-mel spectrograms with deep learning network,” IEEE Access, vol. 7, pp. 125 868–125 881, 2019.
- A. A. Alnuaim, M. Zakariah, C. Shashidhar, W. A. Hatamleh, H. Tarazi, P. K. Shukla, and R. Ratna, “Speaker gender recognition based on deep neural networks and resnet50,” Wireless Communications and Mobile Computing, vol. 2022, pp. 1–13, Mar. 2022.
- A. Tursunov, M. Mustaqeem, J. Y. Choeh, and S. Kwon, “Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms,” Sensors, vol. 21, p. 5892, Sep. 2021.
- C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language identification using deep convolutional recurrent neural networks,” in Neural Information Processing, 2017, pp. 880–889.
- D. Wang, S. Ye, X. Hu, S. Li, and X. Xu, “An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model,” in Proc. Interspeech 2021, 2021, pp. 3266–3270.
- F. Weninger, Y. Sun, J. Park, D. Willett, and P. Zhan, “Deep Learning Based Mandarin Accent Identification for Accent Robust ASR,” in Proc. Interspeech 2019, 2019, pp. 510–514.
- N. Cummins, A. Baird, and B. W. Schuller, “Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning,” Methods, vol. 151, pp. 41–54, Aug. 2018.
- M. India, P. Safari, and J. Hernando, “Double multi-head attention for speaker verification,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6144–6148.
- T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM,” in Proc. Interspeech 2017, 2017, pp. 949–953.
- Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker verification,” arXiv preprint, p. arXiv:1904.03479, 2019.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
- A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
- D. Aromí, “Predicting emotion in speech: a deep learning approach using attention mechanisms,” Univ. Politècnica de Catalunya, B.S. Degree Thesis, 2021.
- D. Garriga, “Deep learning for speaker characterization,” Univ. Politècnica de Catalunya, B.S. Degree Thesis, 2022.
- D. Marchan, “Diseño e implementación de un sistema de deep learning para la detección de covid por la tos con aumento de datos,” Univ. Politècnica de Catalunya, B.S. Degree Thesis, 2022.
- R. Cowie and R. R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech Communication, vol. 40, no. 1, pp. 5–32, Apr. 2003.
- M. M. H. E. Ayadi, M. S. Kamel, and F. Karray, “Speech emotion recognition using gaussian mixture vector autoregressive models,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, vol. 4, 2007, pp. IV–957–IV–960.
- A. Nogueiras, J. B. Mariño, A. Bonafonte, and A. Moreno, “Speech emotion recognition using hidden markov models,” in EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology, 2001, pp. 2679–2682.
- R. Xia and Y. Liu, “Using i-vector space model for emotion recognition,” in Proc. Interspeech 2012, 2012, pp. 2230–2233.
- J. Kim, G. Englebienne, K. P. Truong, and V. Evers, “Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning,” in Proc. Interspeech 2017, 2017, pp. 1113–1117.
- Y. Li, T. Zhao, and T. Kawahara, “Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning,” in Proc. Interspeech 2019, 2019, pp. 2803–2807.
- V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, and A. Nogueiras, “Interface databases: Design and collection of a multilingual emotional speech database,” in Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), May 2002.
- M. Buyukyilmaz and A. O. Cibikdiken, “Voice gender recognition using deep learning,” in Proceedings of 2016 International Conference on Modeling, Simulation and Optimization Technologies and Applications (MSOTA2016), 2016, pp. 409–411.
- F. Ertam, “An effective gender recognition approach using voice data via deeper lstm networks,” Applied Acoustics, vol. 156, pp. 351–358, Aug. 2019.
- “Mozilla catalan common voice dataset,” Jul. 2022. [Online]. Available: https://commonvoice.mozilla.org/ca
- “Aina: La nostra llengua és la teva veu,” Jul. 2022. [Online]. Available: https://www.projecteaina.cat
- M. A. Nessiem, M. M. Mohamed, H. Coppock, A. Gaskell, and B. W. Schuller, “Detecting covid-19 from breathing and coughing sounds using deep neural networks,” in 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), 06 2021, pp. 183–188.
- A. B. Nassif, I. Shahin, M. Bader, A. Hassan, and N. Werghi, “Covid-19 detection systems using deep-learning algorithms based on speech and image data,” Mathematics, vol. 10, no. 4, p. 564, 2022.
- B. Schuller, A. Batliner, C. Bergler, and C. Mascolo, “The interspeech 2021 computational paralinguistics challenge: Covid-19 cough, covid-19 speech, escalation & primates,” INTERSPEECH 2021, 2021.
- Federico Costa (4 papers)
- Miquel India (7 papers)
- Javier Hernando (15 papers)