Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice (2311.15582v1)
Abstract: The Consensus Auditory-Perceptual Evaluation of Voice is a widely employed tool in clinical voice quality assessment that is significant for streaming communication among clinical professionals and benchmarking for the determination of further treatment. Currently, because the assessment relies on experienced clinicians, it tends to be inconsistent, and thus, difficult to standardize. To address this problem, we propose to leverage lightly weighted automatic audio parameter extraction, to increase the clinical relevance, reduce the complexity, and enhance the interpretability of voice quality assessment. The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing. A classical machine learning approach is employed. The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models. This approach provide insights into the feasibility of different feature extraction approaches for voice evaluation. Audio parameters such as jitter and the HNR are proven to be suitable for characterizing voice quality attributes, such as roughness and strain. Conversely, pre-trained models exhibit limitations in effectively addressing noise-related scorings. This study contributes toward more comprehensive and precise voice quality evaluations, achieved by a comprehensively exploring diverse assessment methodologies.
- B. Barsties and M. De Bodt, “Assessment of voice quality: Current state-of-the-art,” Auris Nasus Larynx, vol. 42, no. 3, pp. 183–188, 2015.
- J. M. Miramont, M. A. Colominas, and G. Schlotthauer, “Emulating perceptual evaluation of voice using scattering transform based features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1892–1901, 2022.
- F. Jalalinajafabadi, C. Gadepalli, F. Ascott, J. Homer, M. Luján, and B. Cheetham, “Perceptual evaluation of voice quality and its correlation with acoustic measurement,” in 2013 European Modelling Symposium, 2013, pp. 283–286.
- N. Saenz-Lechon, J. I. Godino-Llorente, V. Osma-Ruiz, M. Blanco-Velasco, and F. Cruz-Roldan, “Automatic assessment of voice quality according to the grbas scale,” in 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 2006, pp. 2478–2481.
- C. Gadepalli, F. Jalalinajafabadi, Z. Xie, B. M. Cheetham, and J. J. Homer, “Voice quality assessment by simulating grbas scoring,” in 2017 European Modelling Symposium (EMS), 2017, pp. 107–111.
- T. Villa-Cañas, J. R. Orozco-Arroyave, J. Arias-Londoño, J. F. Vargas-Bonilla, and J. I. Godino-Llorente, “Automatic assessment of voice signals according to the grbas scale using modulation spectra, mel frequency cepstral coefficients and noise parameters,” in Symposium of Signals, Images and Artificial Vision (STSIVA), 2013, pp. 1–5.
- A. Sasou, “Automatic identification of pathological voice quality based on the grbas categorization,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 1243–1247.
- P. R. Walden, “Perceptual voice qualities database (PVQD): Database characteristics,” Journal of Voice, vol. 36, no. 6, pp. 875.e15–875.e23, 2022.
- K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, pp. 1–40, 2016.
- K. Nemr, M. es Zenari, G. F. Cordeiro, D. Tsuji, A. I. Ogawa, M. T. Ubrig, and M. H. Menezes, “GRBAS and Cape-V scales: high reliability and consensus when applied at different times,” J Voice, vol. 26, no. 6, pp. 17–22, Nov 2012.
- D. V. Borovikova and V. K. Makukha, “Comparative analysis of acoustic voice-quality parameters,” in 2015 16th International Conference of Young Specialists on Micro/Nanotechnologies and Electron Devices, 2015, pp. 569–571.
- Y. K. Singla, J. Shah, C. Chen, and R. R. Shah, “What do audio transformers hear? probing their representations for language delivery & structure,” in 2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022, pp. 910–925.
- A. Mallol-Ragolta, S. Liu, and B. Schuller, “Covid-19 detection exploiting self-supervised learning representations of respiratory sounds,” in 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), 2022, pp. 1–4.
- J. P. Teixeira, C. Oliveira, and C. Lopes, “Vocal acoustic analysis – jitter, shimmer and hnr parameters,” Procedia Technology, vol. 9, pp. 1112–1122, 2013.
- G. Li, Q. Hou, C. Zhang, Z. Jiang, and S. Gong, “Acoustic parameters for the evaluation of voice quality in patients with voice disorders,” Annals of Palliative Medicine, vol. 10, no. 1, 2020.
- A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, oct 2022.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
- G. B. Kempster, B. R. Gerratt, K. V. Abbott, J. Barkmeier-Kraemer, and R. E. Hillman, “Consensus auditory-perceptual evaluation of voice: Development of a standardized clinical protocol,” American Journal of Speech-Language Pathology, vol. 18, no. 2, pp. 124–132, 2009.
- R. Krishnan, P. Rajpurkar, and E. J. Topol, “Self-supervised learning in medicine and healthcare,” Nature Biomedical Engineering, vol. 6, no. 12, pp. 1346–1352, 2022.
- P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of machine learning research, vol. 11, no. 12, 2010.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.