Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems (2402.19443v1)
Abstract: Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.
- F. Jelinek, “Continuous speech recognition by statistical methods,” in Proceedings of the IEEE, 1976.
- D. Su, X. Wu, and L. Xu, “GMM-HMM acoustic model training by a two level procedure with gaussian components determined by automatic model selection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010.
- S. Bhatt, A. Jain, and A. Dev, “Acoustic modeling in speech recognition: A systematic review,” International Journal of Advanced Computer Science and Applications (IJACSA), 2020.
- P. Dighe, A. Asaei, and H. Bourlard, “On quantifying the quality of acoustic models in hybrid dnn-hmm asr,” Speech Communication, 2020.
- A. Singh, S. Sengupta, and V. Lakshminarayanan, “Explainable deep learning models in medical image analysis,” Journal of Imaging, 2020.
- M. R. Karim, S. K. Dey, T. Islam, S. Sarker, M. H. Menon, K. Hossain, M. A. Hossain, and S. Decker, “Deephateexplainer: Explainable hate speech detection in under-resourced bengali language,” in International Conference on Data Science and Advanced Analytics (DSAA), 2021.
- H. Bharadhwaj, “Layer-wise relevance propagation for explainable deep learning based speech recognition,” in IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 2018.
- A.-r. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
- T. Nagamine and N. Mesgarani, “Understanding the representation and computation of multilayer perceptrons: A case study in speech recognition,” in International Conference on Machine Learning (ICML), 2017.
- Y. Belinkov and J. Glass, “Analyzing hidden representations in end-to-end automatic speech recognition systems,” Advances in Neural Information Processing Systems, 2017.
- Y. Belinkov, A. Ali, and J. Glass, “Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition,” in Interspeech, 2019.
- J. Williams and S. King, “Disentangling style factors from speaker representations.” in Interspeech, 2019.
- D. Raj, D. Snyder, D. Povey, and S. Khudanpur, “Probing the information encoded in X-Vectors,” in IEEE Automatic Speech Recognition and Understanding (ASRU), 2019.
- C.-Y. Li, P.-C. Yuan, and H.-Y. Lee, “What does a network layer hear? analyzing hidden representations of end-to-end ASR through speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
- P. Kumar, V. Kaushik, and B. Raman, “Towards the explainability of multimodal speech emotion recognition,” in Interspeech, 2021.
- S. Mdhaffar, J.-F. Bonastre, M. Tommasi, N. Tomashenko, and Y. Estève, “Retrieving speaker information from personalized acoustic models for speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks.” in Interspeech, 2018.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification,” in Interspeech, 2020.
- J. Thienpondt, B. Desplanques, and K. Demuynck, “The IDLab VoxSRC-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
- J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in International Conference on Multimedia (ACM), 2014.
- M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
- S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Association for Computational Linguistics (ACL), 2019.
- Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.