What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis (2107.00439v3)
Abstract: Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework [1]. Specifically, we analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. We conduct layer and neuron-wise analyses, probing for speaker, language, and channel properties. Our study aims to answer the following questions: i) what information is captured within the representations? ii) how is it represented and distributed? and iii) can we identify a minimal subset of the network that possesses this information? Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, iii) complex properties such as dialectal information are encoded only in the task-oriented pretrained network, iv) and is localised in the upper layers, v) we can extract a minimal subset of neurons encoding the pre-defined property, vi) salient neurons are sometimes shared between properties, vii) our analysis highlights the presence of biases (for example gender) in the network. Our cross-architectural comparison indicates that: i) the pretrained models capture speaker-invariant information, and ii) CNN models are competitive with Transformer models in encoding various understudied properties.
- Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure, arXiv preprint arXiv:1711.10203 (2018).
- Deep learning for computer vision: A brief review, Computational intelligence and neuroscience (2018).
- Deep speech 2: End-to-end speech recognition in English and Mandarin, in: International conference on machine learning, 2016, pp. 173–182.
- EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, in: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, 2015, pp. 167–174.
- Wav2letter++: A fast open-source speech recognition system, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 6460–6464.
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 4960–4964.
- Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR, Proc. Interspeech 2021 (2021).
- Arabic Code-Switching Speech Recognition using Monolingual Data, Proc. Interspeech 2021 (2021).
- wav2vec 2.0: A framework for self-supervised learning of speech representations, arXiv preprint arXiv:2006.11477 (2020).
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020a, pp. 6419–6423.
- Tera: Self-supervised learning of transformer encoder representation for speech, arXiv preprint arXiv:2007.06028 (2020b).
- Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation, arXiv preprint arXiv:2005.08575 (2020).
- End-to-end language identification using high-order utterance representation with bilinear pooling, International Speech Communication Society (2017).
- Deep Language: a comprehensive deep learning approach to end-to-end language recognition., in: Odyssey, 2016, pp. 109–116.
- End-to-end text-dependent speaker verification, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 5115–5119.
- VoxCeleb: A Large-Scale Speaker Identification Dataset, Proc. Interspeech 2017 (2017) 2616–2620.
- Deep Neural Network Embeddings for Text-Independent Speaker Verification., in: Interspeech, 2017, pp. 999–1003.
- Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition, in: Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 98–104.
- X-vectors: Robust dnn embeddings for speaker recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 5329–5333.
- F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, arXiv preprint arXiv:1702.08608 (2017).
- Z. C. Lipton, The mythos of model interpretability, Queue 16 (2018) 31–57.
- What is one grain of sand in the desert? analyzing individual neurons in deep nlp models, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 6309–6317.
- What do Neural Machine Translation Models Learn about Morphology?, in: ACL (1), 2017.
- Understanding and improving morphological learning in the neural machine translation decoder, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 142–151.
- Q.-s. Zhang, S.-C. Zhu, Visual interpretability for deep learning: a survey, Frontiers of Information Technology & Electronic Engineering 19 (2018) 27–39.
- M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.
- AutoAblation: Automated Parallel Ablation Studies for Deep Learning, in: Proceedings of the 1st Workshop on Machine Learning and Systems, 2021, pp. 55–61.
- Identifying and Controlling Important Neurons in Neural Machine Translation, in: International Conference on Learning Representations (ICLR), 2019.
- D. Harwath, J. Glass, Learning word-like units from joint audio-visual analysis, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 506–517.
- Discovering latent concepts learned in BERT, in: International Conference on Learning Representations, 2022. URL: https://openreview.net/forum?id=POTMtpYI1xH.
- On the transformation of latent space in fine-tuned nlp models, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Abu Dhabi, UAE, 2022, pp. 1495–1516. URL: https://aclanthology.org/2022.emnlp-main.97. doi:10.18653/v1/2020.emnlp-main.395.
- The 2013 speaker recognition evaluation in mobile environment, in: 2013 International Conference on Biometrics (ICB), 2013, pp. 1–8. doi:10.1109/ICB.2013.6613025.
- G. Beguš, A. Zhou, Interpreting intermediate convolutional layers of CNNs trained on raw speech, arXiv e-prints (2021) arXiv–2104.
- Analyzing Learned Representations of a Deep ASR Performance Prediction Model, in: Blackbox NLP Workshop and EMLP 2018, 2018.
- What does an End-to-End Dialect Identification Model Learn about Non-dialectal Information?, Proc. Interspeech 2020 (2020) 462–466.
- What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure, arXiv preprint arXiv:2101.00387 (2021).
- What does the speaker embedding encode?, in: Interspeech, 2017, pp. 1497–1501.
- Interpreting and explaining deep neural networks for classification of audio signals, arXiv preprint arXiv:1807.03418 (2018).
- H. Ghader, C. Monz, What does Attention in Neural Machine Translation Pay Attention to?, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 30–39.
- Analyzing Individual Neurons in Pre-trained Language Models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4865–4880.
- Analyzing linguistic knowledge in sequential model of sentence, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 826–835.
- Region-Based Convolutional Networks for Accurate Object Detection and Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016) 142–158. doi:10.1109/TPAMI.2015.2437384.
- Synthesizing the preferred inputs for neurons in neural networks via deep generator networks, Advances in neural information processing systems 29 (2016) 3387–3395.
- Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6541–6549.
- Why neural translations are the right length, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2278–2282.
- TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, in: R. P. Adams, V. Gogate (Eds.), Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, AUAI Press, 2020, p. 197.
- J. Frankle, M. Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, in: International Conference on Learning Representations, 2018.
- Unsupervised pre-training of bidirectional speech encoders via masked reconstruction, arXiv preprint arXiv:2001.10603 (2020).
- Automatic identification of gender from speech, in: Proceeding of speech prosody, Semantic Scholar, 2016, pp. 84–88.
- Are Pre-trained Convolutions Better than Pre-trained Transformers?, arXiv preprint arXiv:2105.03322 (2021).
- Similarity analysis of self-supervised speech representations, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 3040–3044.
- Autoregressive predictive coding: A comprehensive study, IEEE Journal of Selected Topics in Signal Processing 16 (2022) 1380–1390.
- Comparative layer-wise analysis of self-supervised speech models, arXiv preprint arXiv:2211.03929 (2022).
- Dissecting Contextual Word Embeddings: Architecture and Representation, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1499–1509.
- Does string-based neural MT learn source syntax?, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1526–1534.
- Deep RNNs Encode Soft Hierarchical Syntax, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 14–19.
- NeuroX: A toolkit for analyzing individual neurons in neural networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 9851–9852.
- Context-Aware Neural Machine Translation Learns Anaphora Resolution, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1264–1274.
- LSTMs Exploit Linguistic Attributes of Data, in: Proceedings of The Third Workshop on Representation Learning for NLP, 2018, pp. 180–186.
- Assessing the ability of LSTMs to learn syntax-sensitive dependencies, Transactions of the Association for Computational Linguistics 4 (2016) 521–535.
- S. A. Chowdhury, R. Zamparelli, RNN simulations of grammaticality judgments on long-distance dependencies, in: Proceedings of the 27th international conference on computational linguistics, 2018, pp. 133–144.
- Colorless Green Recurrent Networks Dream Hierarchically, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1195–1205.
- Probing for semantic evidence of composition by means of simple classification tasks, in: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, 2016, pp. 134–139.
- What you can cram into a single vector: Probing sentence embeddings for linguistic properties, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
- P. Merlo, Probing word and sentence embeddings for long-distance dependencies effects in French and English, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 158–172.
- Y. Belinkov, J. Glass, Analysis methods in neural language processing: A survey, Transactions of the Association for Computational Linguistics 7 (2019) 49–72.
- Neuron-level interpretation of deep NLP models: A survey, Transactions of the Association for Computational Linguistics 10 (2022) 1285–1303. URL: https://aclanthology.org/2022.tacl-1.74. doi:10.1162/tacl_a_00519.
- Representations of language in a model of visually grounded speech signal, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 613–622.
- Exploring how deep neural networks form phonemic categories, in: Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models., in: Interspeech, 2016, pp. 803–807.
- Learning Weakly Supervised Multimodal Phoneme Embeddings, Proc. Interspeech 2017 (2017) 2218–2222.
- Do RNN States Encode Abstract Phonological Processes?, arXiv preprint arXiv:2104.00789 (2021).
- Probing speech emotion recognition transformers for linguistic knowledge, arXiv preprint arXiv:2204.00400 (2022).
- Z. Wu, S. King, Investigating gated recurrent networks for speech synthesis, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 5140–5144.
- Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries, arXiv preprint arXiv:1703.07588 (2017).
- Object detectors emerge in deep scene cnns., arXiv preprint arXiv:1412.6856 (2014).
- Gan dissection: Visualizing and understanding generative adversarial networks, arXiv preprint arXiv:1811.10597 (2018).
- H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the royal statistical society: series B (statistical methodology) 67 (2005) 301–320.
- J. Hewitt, P. Liang, Designing and Interpreting Probes with Control Tasks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2733–2743.
- Information-theoretic probing for linguistic structure, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 4609–4622.
- E. Voita, I. Titov, ”information-theoretic probing with minimum description length”, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2020.
- Analyzing redundancy in pretrained transformer models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4908–4926.
- Connecting Arabs: bridging the gap in dialectal speech recognition, Communications of the ACM 64 (2021) 124–129.
- ADI17: A Fine-Grained Arabic Dialect Identification Dataset, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8244–8248.
- The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 1026–1033.
- Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model, in: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 1007–1013.
- Voxceleb2: Deep speaker recognition, arXiv preprint arXiv:1806.05622 (2018).
- Common Voice: A Massively-Multilingual Speech Corpus, arXiv preprint arXiv:1912.06670 (2019).
- Speech recognition challenge in the wild: Arabic MGB-3, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2017, pp. 316–322.
- Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation, in: Proceedings of International Workshop on Spoken Language Translation (IWSLT), Citeseer, 2014.
- Multilingual speech recognition: The 1996 byblos callhome system, in: Fifth European Conference on Speech Communication and Technology, 1997.
- A survey on bias and fairness in machine learning, ACM Computing Surveys (CSUR) 54 (2021) 1–35.
- On the effect of dropping layers of pre-trained transformer models, Computer Speech and Language 77 (2023) 101429. URL: https://www.sciencedirect.com/science/article/pii/S0885230822000596. doi:https://doi.org/10.1016/j.csl.2022.101429.
- ”to tune or not to tune? adapting pretrained representations to diverse tasks”, in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Association for Computational Linguistics, Florence, Italy, 2019, pp. 7–14.
- S. Hameed, Filter-Wrapper Combination and Embedded Feature Selection for Gene Expression Data, International Journal of Advances in Soft Computing and its Applications 10 (2018) 90–105.
- ”GLUE: A multi-task benchmark and analysis platform for natural language understanding”, in: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 353–355.
- Shammur Absar Chowdhury (31 papers)
- Nadir Durrani (48 papers)
- Ahmed Ali (72 papers)