Crowdsourced and Automatic Speech Prominence Estimation (2310.08464v2)
Abstract: The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.
- “Sound, structure and meaning: The bases of prominence ratings in English, French and Spanish,” Journal of Phonetics, 2019.
- “A crosslinguistic study of prosodic focus,” in International Conference on Acoustics, Speech, and Signal Processing, 2015.
- “Emphasis control for parallel neural TTS,” in Interspeech, 2022.
- “Prosodic prominence and boundaries in sequence-to-sequence speech synthesis,” in Speech Prosody, May 2020.
- “A model for varying speaking style in TTS systems,” in Speech Prosody, 2010.
- “Emotion recognition from speech using global and local prosodic features,” International Journal of Speech Technology, 2013.
- “Automatic emphatic information extraction from aligned acoustic data and its application on sentence compression,” AAAI Conference on Artificial Intelligence, 2017.
- “3PRO – An unsupervised method for the automatic detection of sentence prominence in speech,” Speech Communication, 2016.
- “Hierarchical representation and estimation of prosody using continuous wavelet transform,” Computer Speech & Language, 2017.
- “Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis,” in IEEE Spoken Language Technology Workshop, 2021.
- “Controlling prominence realisation in parametric DNN-based speech synthesis,” in Interspeech, 2017.
- “Predicting prosodic prominence from text with pre-trained contextualized word representations,” in Nordic Conference on Computational Linguistics, 2019.
- “BERT, can HE predict contrastive focus? predicting and controlling prominence in neural TTS using a language model,” in Interspeech, 2022.
- “Word prominence detection using robust yet simple prosodic features,” in Interspeech, 2012.
- “Automatic labelling of prosodic prominence, phrasing and disfluencies in French speech by simulating the perception of naïve and expert listeners,” in Interspeech, 2017.
- “Acoustic and temporal representations in convolutional neural network models of prosodic events,” Speech Communication, 2020.
- “Deep learning for prominence detection in children’s read speech,” in International Conference on Acoustics, Speech and Signal Processing, 2022.
- “Prosodic event detection in children’s read speech,” Computer Speech & Language, 2021.
- “Reproducible subjective evaluation,” in ICLR Workshop on ML Evaluation Standards, 2022.
- “Fast and easy crowdsourced perceptual audio evaluation,” in International Conference on Acoustics, Speech and Signal Processing, 2016.
- “Crowd-sourcing prosodic annotation,” Computer Speech & Language, 2017.
- “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019.
- “Bots or inattentive humans? Identifying sources of low-quality data in online platforms,” PsyArXiv preprint PsyArXiv:wr8ds, 2021.
- “py-irt: A scalable item response theory library for Python,” INFORMS Journal on Computing, 2023.
- “Rectifier nonlinearities improve neural network acoustic models,” in International Conference on Machine Learning, 2013.
- “Gaussian error linear units,” arXiv preprint arXiv:1606.08415, 2016.
- “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks, 2018.
- “Attention is all you need,” in Neural Information Processing Systems, 2017.
- “The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability,” Speech Communication, 2005.
- “Speaker identification on the SCOTUS corpus,” The Journal of the Acoustical Society of America, 2008.
- Max Morrison, “Python forced alignment (version 0.0.3),” https://github.com/maxrmorrison/pyfoal, 2023.
- “On batching variable size inputs for training end-to-end speech enhancement systems,” arXiv preprint arXiv:2301.10587, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.