Acoustic models of Brazilian Portuguese Speech based on Neural Transformers (2312.09265v1)
Abstract: An acoustic model, trained on a significant amount of unlabeled data, consists of a self-supervised learned speech representation useful for solving downstream tasks, perhaps after a fine-tuning of the model in the respective downstream task. In this work, we build an acoustic model of Brazilian Portuguese Speech through a Transformer neural network. This model was pretrained on more than $800$ hours of Brazilian Portuguese Speech, using a combination of pretraining techniques. Using a labeled dataset collected for the detection of respiratory insufficiency in Brazilian Portuguese speakers, we fine-tune the pretrained Transformer neural network on the following tasks: respiratory insufficiency detection, gender recognition and age group classification. We compare the performance of pretrained Transformers on these tasks with that of Transformers without previous pretraining, noting a significant improvement. In particular, the performance of respiratory insufficiency detection obtains the best reported results so far, indicating this kind of acoustic model as a promising tool for speech-as-biomarker approach. Moreover, the performance of gender recognition is comparable to the state of the art models in English.
- Detecting respiratory insufficiency via voice analysis: The spira project. In Practical Machine Learning for Developing Countries on the Tenth International Conference on Learning Representations. Proceeding. ICLR, 2022.
- Layer normalization, 2016.
- vq-wav2vec: Self-supervised learning of discrete speech representations, 2019.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 12449–12460.
- The fast fourier transform. IEEE Spectrum 4, 12 (1967), 63–70.
- Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021. In Proc. Interspeech 2021 (2021), pp. 446–450. Stefan Steidl Computational Paralinguistics Award, COVID-19 Cough Sub-Challenge Prize.
- Deep learning against COVID-19: Respiratory insufficiency detection in Brazilian Portuguese speech. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (Online, Aug. 2021), Association for Computational Linguistics, pp. 625–633.
- A linguagem falada culta na cidade de São Paulo: materiais para seu estudo. Fapesp, 1986.
- Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 27, 12 (dec 2019), 2041–2053.
- Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech. In Proc. Interspeech 2018 (2018), pp. 811–815.
- An Unsupervised Autoregressive Model for Speech Representation Learning. In Proc. Interspeech 2019 (2019), pp. 146–150.
- Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder. In Proc. Interspeech 2016 (2016), pp. 765–769.
- Detection of covid-19 from voice, cough and breathing patterns: Dataset and preliminary results. Computers in Biology and Medicine 138 (2021), 104944.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Minneapolis, Minnesota, June 2019), Association for Computational Linguistics, pp. 4171–4186.
- Temporal prosodic cues for COVID-19 in Brazilian Portuguese speakers. In Proc. Speech Prosody 2022 (2022), pp. 210–214.
- A comparison of deep learning architectures for automatic gender recognition from audio signals. In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (Porto Alegre, RS, Brasil, 2021), SBC, pp. 715–726.
- Detecting respiratory insufficiency by voice analysis: the SPIRA project. Instituto de Psicologia. Universidade de São Paulo, 2021.
- Audio mfcc-gram transformers for respiratory insufficiency detection in covid-19. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (Porto Alegre, RS, Brasil, 2021), SBC, pp. 143–152.
- Pretrained audio neural networks for speech emotion recognition in portuguese. Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese (SE&R), collocated with PROPOR 2022 (2022).
- Gonçalves, S. C. L. Projeto alip (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do português brasileiro. Estudos Linguísticos (São Paulo. 1978) 48, 1 (2019), 276–297.
- Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese, 2021.
- Unsupervised learning of efficient and robust speech representations.
- Gender and age estimation methods based on speech using deep neural networks. Sensors 21, 14 (2021), 4785.
- Covid-19 artificial intelligence diagnosis using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology 1 (2020), 275–281.
- Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. In Proc. Interspeech 2019 (2019), pp. 1108–1112.
- TERA: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2351–2366.
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (May 2020), pp. 6419–6423.
- Mendes, R. B. Projeto sp2010: Amostra da fala paulistana. http://projetosp2010. fflch. usp. br¿. Acesso em 07.07.2022 1, 12 (2013), 2013.
- Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (nurc). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos 3, 2 (2016), 149–174.
- Representation learning with contrastive predictive coding, 2018.
- From frequency to quefrency: a history of the cepstrum. IEEE Signal Processing Magazine 21, 5 (2004), 95–106.
- Very Deep Self-Attention Networks for End-to-End Speech Recognition. In Proc. Interspeech 2019 (2019), pp. 66–70.
- Sars-cov-2 detection from voice. IEEE Open Journal of Engineering in Medicine and Biology 1 (2020), 268–274.
- The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (Istanbul, Turkey, May 2012), European Language Resources Association (ELRA), pp. 106–113.
- Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimedia Tools and Applications (2022), 1–18.
- wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019 (2019), pp. 3465–3469.
- Interpretability analysis of deep models for covid-19 detection. Draft. Personal Communication (2022).
- Speech-xlnet: Unsupervised acoustic model pretraining for self-attention networks, 2019.
- Self-Attentional Acoustic Models. In Proc. Interspeech 2018 (2018), pp. 3723–3727.
- Taylor, W. L. Cloze procedure: A new tool for measuring readability. Journalism Quarterly 30, 4 (1953), 415–433.
- Teixeira, C. S. P. Acervo Certas Palavras- Catálogo 1981-1996. Unicamp Cedae, 1997.
- Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 21, 17 (2021), 5892.
- Attention is all you need. In Advances in Neural Information Processing Systems (2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30, Curran Associates, Inc.
- Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems (2019), H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32, Curran Associates, Inc.
- Speech database development at mit: Timit and beyond. Speech Communication 9, 4 (1990), 351–356.