BWSNet: Automatic Perceptual Assessment of Audio Signals (2309.02592v2)
Abstract: This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities. For both datasets, our results show that the structure of the latent space is faithful to human judgements.
- “A note on the evaluation of generative models,” arXiv preprint arXiv:1511.01844, 2016.
- “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” arXiv preprint arXiv:1903.12389, 2019.
- “Convs2s-vc: Fully convolutional sequence-to-sequence voice conversion,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 28, pp. 1849–1863, 2020.
- “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, 2022.
- “Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels,” in EUSIPCO, Dublin (virtual ), Ireland, 2021.
- “Response styles in marketing research: A cross-national investigation,” Journal of marketing research, vol. 38, no. 2, pp. 143–156, 2001.
- Questions and answers in attitude surveys: Experiments on question form, wording, and context, Sage, 1996.
- Best-Worst Scaling: Theory, Methods and Applications, Cambridge University Press, 2015.
- “Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, July 2017, pp. 465–470, Association for Computational Linguistics.
- Geoff Hollis, “Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments,” Behavior research methods, vol. 50, no. 2, pp. 711–729, 2018.
- “Best-worst scaling, an alternative method to assess perceptual sound qualities,” The Journal of the Acoustical Society of America, vol. 2, pp. 064404, 06 2022.
- Clément Le Moine Veillon, Neural Conversion of Social Attitudes in Speech Signals, Ph.D. thesis, 2023, Thèse de doctorat dirigée par Roebel, Axel et Obin, Nicolas Informatique Sorbonne université 2023.
- “Shared mental representations underlie metaphorical sound concepts,” Scientific Reports, vol. 13, no. 1, pp. 5180, 2023.
- “Automos: Learning a non-intrusive assessor of naturalness-of-speech,” in NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop, 2016.
- “Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM,” 09 2018, pp. 1873–1877.
- “MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion,” in Proc. Interspeech, 2019, pp. 1541–1545.
- “Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling,” in Proc. Interspeech, 2020, pp. 1743–1747.
- “Neural MOS prediction for synthesized speech using multi-task learning with spoofing detection and spoofing type classification,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 462–469.
- “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition, Aasa Feragen, Marcello Pelillo, and Marco Loog, Eds., Cham, 2015, pp. 84–92, Springer International Publishing.
- “Att-HACK: An Expressive Speech Database with Social Attitudes,” in Speech Prosody, Tokyo, Japan, 2020.
- “Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning,” in Proc. Interspeech, 2019, pp. 2803–2807.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.