Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MeWEHV: Mel and Wave Embeddings for Human Voice Tasks (2209.14078v2)

Published 28 Sep 2022 in cs.SD and eess.AS

Abstract: A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English speaker identification that contains 19607 audio clips from 204 persons speaking in six different accents, allowing other researchers to work with a very balanced dataset, and to create new models that are robust to multiple accents. For evaluating the language identification task, we use the VoxForge and Common Language datasets. Finally, for accent identification, we use the Latin American Spanish Corpora (LASC) and Common Voice datasets. Our approach allows a significant increase in the performance of state-of-the-art models on all the tested datasets, with a low additional computational cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Accentdb: A database of non-native english accents to assist neural speech recognition. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 5353–5360). Marseille, France: European Language Resources Association.
  2. A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In Eighth International Conference on Advances in Pattern Recognition, ICAPR 2015, Kolkata, India, January 4-7, 2015 (pp. 1–6). IEEE. doi:10.1109/ICAPR.2015.7050669.
  3. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 4218–4222). Marseille, France: European Language Resources Association.
  4. Common voice: A massively-multilingual speech corpus. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020 (pp. 4218–4222). European Language Resources Association.
  5. Resources for indian languages. In Proceedings of Text, Speech and Dialogue.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
  7. Mass: A large and clean multilingual corpus of sentence-aligned spoken utterances extracted from the bible. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020 (pp. 6486–6493). European Language Resources Association.
  8. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res., 63, 743–788. doi:10.1613/jair.1.11259.
  9. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. CoRR, abs/2110.13900. arXiv:2110.13900.
  10. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021 (pp. 2426--2430). doi:10.21437/Interspeech.2021-329.
  11. Developing speech recognition systems for corpus indexing under the IARPA babel program. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013 (pp. 6753--6757). IEEE. doi:10.1109/ICASSP.2013.6638969.
  12. General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline. In Scenes and Events 2018 Workshop (DCASE2018) (pp. 69--73).
  13. Multi-representation knowledge distillation for audio classification. CoRR, abs/2002.09607. arXiv:2002.09607.
  14. FuzzyGCP: A deep learning architecture for automatic spoken language identification from speech signals. Expert Systems with Applications, 168, 114416. doi:10.1016/j.eswa.2020.114416.
  15. Garofolo, J. S. (1993). Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, .
  16. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017 (pp. 776--780). IEEE. doi:10.1109/ICASSP.2017.7952261.
  17. Crowdsourcing latin american spanish for low-resource text-to-speech. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020 (pp. 6504--6513). European Language Resources Association.
  18. Getting started with SUSAS: a speech under simulated and actual stress database. In G. Kokkinakis, N. Fakotakis, & E. Dermatas (Eds.), Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece, September 22-25, 1997. ISCA.
  19. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29, 3451--3460. doi:10.1109/TASLP.2021.3122291.
  20. Aispeech-sjtu accent identification system for the accented english speech recognition challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021 (pp. 6254--6258). IEEE. doi:10.1109/ICASSP39728.2021.9414292.
  21. Libri-light: A benchmark for ASR with limited or no supervision. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 (pp. 7669--7673). IEEE. doi:10.1109/ICASSP40776.2020.9052942.
  22. Heart sound analysis using mfcc and time frequency distribution. In World Congress on Medical Physics and Biomedical Engineering 2006 (pp. 946--949). Springer.
  23. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL (pp. 36--45). ACL. doi:10.3115/v1/d14-1005.
  24. Semi-supervised learning with deep generative models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada (pp. 3581--3589).
  25. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28, 2880--2894. doi:10.1109/TASLP.2020.3030497.
  26. Generalisation in environmental sound classification: The ’making sense of sounds’ data set and challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019 (pp. 8082--8086). IEEE. doi:10.1109/ICASSP.2019.8683292.
  27. A syllable structure approach to spoken language recognition. In T. Dutoit, C. Martín-Vide, & G. Pironkov (Eds.), Statistical Language and Speech Processing - 6th International Conference, SLSP 2018, Mons, Belgium, October 15-16, 2018, Proceedings (pp. 56--66). Springer volume 11171 of Lecture Notes in Computer Science. doi:10.1007/978-3-030-00810-9_6.
  28. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., 40, 2935--2947. doi:10.1109/TPAMI.2017.2773081.
  29. AP20-OLR challenge: Three tasks and their baselines. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2020, Auckland, New Zealand, December 7-10, 2020 (pp. 550--555). IEEE.
  30. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13, e0196391.
  31. Speech model pre-training for end-to-end spoken language understanding. In G. Kubin, & Z. Kacic (Eds.), Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019 (pp. 814--818). ISCA. doi:10.21437/Interspeech.2019-2396.
  32. MacLean, K. (2018). Voxforge. URL: http://www.voxforge.org/.
  33. An exploration of log-mel spectrogram and MFCC features for alzheimer’s dementia recognition from spontaneous speech. In IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021 (pp. 670--677). IEEE. doi:10.1109/SLT48900.2021.9383491.
  34. A multi-device dataset for urban acoustic scene classification. In Scenes and Events 2018 Workshop (DCASE2018) (pp. 9--13).
  35. Acoustic event classification using spectrogram features. In TENCON 2018 - 2018 IEEE Region 10 Conference, Jeju, South Korea, October 28-31, 2018 (pp. 1460--1464). IEEE. doi:10.1109/TENCON.2018.8650444.
  36. Voxceleb: A large-scale speaker identification dataset. In F. Lacerda (Ed.), Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017 (pp. 2616--2620). ISCA.
  37. Emotional speaker identification using a novel capsule nets model. Expert Syst. Appl., 193, 116469. doi:10.1016/j.eswa.2021.116469.
  38. Casa-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Appl. Soft Comput., 103, 107141. doi:10.1016/j.asoc.2021.107141.
  39. BERT-LID: leveraging BERT to improve spoken language identification. CoRR, abs/2203.00328. doi:10.48550/arXiv.2203.00328. arXiv:2203.00328.
  40. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 (pp. 5206--5210). IEEE. doi:10.1109/ICASSP.2015.7178964.
  41. Piczak, K. J. (2015). ESC: dataset for environmental sound classification. In X. Zhou, A. F. Smeaton, Q. Tian, D. C. A. Bulterman, H. T. Shen, K. Mayer-Patel, & S. Yan (Eds.), Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015 (pp. 1015--1018). ACM. doi:10.1145/2733373.2806390.
  42. The IIIT-H indic speech databases. In INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012 (pp. 2546--2549). ISCA.
  43. MLS: A large-scale multilingual dataset for speech research. In H. Meng, B. Xu, & T. F. Zheng (Eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association (pp. 2757--2761). ISCA. doi:10.21437/Interspeech.2020-2826.
  44. Spoken language identification using convnets. In I. Chatzigiannakis, B. E. R. de Ruyter, & I. Mavrommati (Eds.), Ambient Intelligence - 15th European Conference, AmI 2019, Rome, Italy, November 13-15, 2019, Proceedings (pp. 252--265). Springer volume 11912 of Lecture Notes in Computer Science. doi:10.1007/978-3-030-34255-5_17.
  45. Novel cascaded gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput. Appl., 32, 2575--2587. doi:10.1007/s00521-018-3760-2.
  46. The accented english speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021 (pp. 6918--6922). IEEE. doi:10.1109/ICASSP39728.2021.9413386.
  47. Towards learning a universal non-semantic representation of speech. In H. Meng, B. Xu, & T. F. Zheng (Eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association (pp. 140--144). ISCA. doi:10.21437/Interspeech.2020-1242.
  48. Commonlanguage. URL: https://doi.org/10.5281/zenodo.5036977. doi:10.5281/zenodo.5036977.
  49. Time delay deep neural network-based universal background models for speaker recognition. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015 (pp. 92--97). IEEE. doi:10.1109/ASRU.2015.7404779.
  50. X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018 (pp. 5329--5333). IEEE. doi:10.1109/ICASSP.2018.8461375.
  51. Tiwari, V. (2010). MFCC and its applications in speaker recognition. International journal on emerging technologies, 1, 19--22.
  52. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10, 293--302. doi:10.1109/TSA.2002.800560.
  53. An end-to-end dialect identification system with transfer learning from a multilingual automatic speech recognition model. In H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg, & P. Motlícek (Eds.), Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021 (pp. 3266--3270). ISCA. doi:10.21437/Interspeech.2021-374.
  54. THCHS-30 : A free chinese speech corpus. CoRR, abs/1512.01882. arXiv:1512.01882.
  55. Attentive temporal pooling for conformer-based streaming language identification in long-form speech. CoRR, abs/2202.12163. arXiv:2202.12163.
  56. Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. CoRR, abs/1804.03209. arXiv:1804.03209.
  57. A discriminative feature learning approach for deep face recognition. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII (pp. 499--515). Springer volume 9911 of Lecture Notes in Computer Science. URL: https://doi.org/10.1007/978-3-319-46478-7_31. doi:10.1007/978-3-319-46478-7_31.
  58. SUPERB: speech processing universal performance benchmark. In H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg, & P. Motlícek (Eds.), Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021 (pp. 1194--1198). ISCA. doi:10.21437/Interspeech.2021-1775.
  59. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78, 3705--3722. doi:10.1007/s11042-017-5539-3.
  60. Audio tagging by cross filtering noisy labels. IEEE ACM Trans. Audio Speech Lang. Process., 28, 2073--2083. doi:10.1109/TASLP.2020.3008832.
  61. Negative log likelihood ratio loss for deep neural network classification. CoRR, abs/1804.10690. URL: http://arxiv.org/abs/1804.10690. arXiv:1804.10690.
  62. A comprehensive survey on transfer learning. Proc. IEEE, 109, 43--76. doi:10.1109/JPROC.2020.3004555.
  63. Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling. In Proceedings of ICASSP ’94: IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, South Australia, Australia, April 19-22, 1994 (pp. 305--308). IEEE Computer Society. doi:10.1109/ICASSP.1994.389377.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Andrés Carofilis (14 papers)
  2. Laura Fernández-Robles (3 papers)
  3. Enrique Alegre (11 papers)
  4. Eduardo Fidalgo (10 papers)
Citations (1)