Prosody Analysis of Audiobooks (2310.06930v2)
Abstract: Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, volume, and rate of speech) from narrative text using LLMing. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of the 24 books, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over commercial text-to-speech systems.
- Deep voice 2: Multi-speaker neural text-to-speech.
- Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 195–204. PMLR.
- Representing movie characters in dialogues. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 99–109, Hong Kong, China. Association for Computational Linguistics.
- A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 370–379, Baltimore, Maryland. Association for Computational Linguistics.
- Multi-reference tacotron by intercross training for style disentangling, transfer and control in speech synthesis. ArXiv, abs/1904.02373.
- “let your characters tell their story”: A dataset for character-centric narrative understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1734–1752, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pages 42–47, Denver, Colorado, USA. Association for Computational Linguistics.
- Methods for precise named entity matching in digital collections. JCDL ’03, page 125–127, USA. IEEE Computer Society.
- Evaluating named entity recognition tools for extracting social networks from novels. PeerJ Computer Science, 5:e189.
- Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. In Proc. Interspeech 2021, pages 141–145.
- Parallel tacotron: Non-autoregressive and controllable tts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5709–5713. IEEE.
- Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 138–147, Uppsala, Sweden. Association for Computational Linguistics.
- Adrian Groza and Lidia Corde. 2015. Information retrieval in folktales using natural language processing. In 2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pages 59–66. IEEE.
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
- Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations.
- Learning and evaluating character representations in novels. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1008–1019, Dublin, Ireland. Association for Computational Linguistics.
- Feuding families and former Friends: Unsupervised learning for dynamic fictional relationships. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1534–1544, San Diego, California. Association for Computational Linguistics.
- Labiba Jahan and Mark Finlayson. 2019. Character identification refined: A proposal. In Proceedings of the First Workshop on Narrative Understanding, pages 12–18, Minneapolis, Minnesota. Association for Computational Linguistics.
- Sunghee Jung and Hoirin Kim. 2020. Pitchtron: Towards audiobook generation from ordinary people’s voices. arXiv preprint arXiv:2005.10456.
- What time is it? temporal analysis of novels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9076–9086, Online. Association for Computational Linguistics.
- Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686, Melbourne, Australia. Association for Computational Linguistics.
- Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592. IEEE.
- Lightspeech: Lightweight and fast text to speech with neural architecture search. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE.
- The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
- Saif Mohammad. 2011. From once upon a time to happily ever after: Tracking emotions in novels and fairy tales. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 105–114, Portland, OR, USA. Association for Computational Linguistics.
- LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks. In Proc. Interspeech 2021, pages 3595–3599.
- A sequence labelling approach to quote attribution. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 790–799, Jeju Island, Korea. Association for Computational Linguistics.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- Modeling event salience in narratives via barthes’ cardinal functions. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1784–1794, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Screenplay summarization using latent narrative structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1920–1933, Online. Association for Computational Linguistics.
- Chapter Captor: Text Segmentation in Novels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8373–8383, Online. Association for Computational Linguistics.
- Deep voice 3: 2000-speaker neural text-to-speech.
- Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
- Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32.
- Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. ArXiv, abs/2010.04301.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE.
- Aghilas Sini. 2020. Characterisation and generation of expressivity in function of speaking styles for audiobook synthesis. Ph.D. thesis, Université Rennes 1.
- Emotional prosody control for speech generation. arXiv preprint arXiv:2111.04730.
- Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. ArXiv, abs/1803.09047.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
- StonyBook. n.d. https://stonybook.org/.
- Paul Taylor and Amy Isard. 1997. Ssml: A speech synthesis markup language. Speech communication, 21(1-2):123–133.
- Jan Vainer and Ondřej Dušek. 2020. Speedyspeech: Efficient neural speech synthesis.
- Annotating characters in literary corpora: A scheme, the CHARLES tool, and an annotated novel. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 184–189, Portorož, Slovenia. European Language Resources Association (ELRA).
- Mr. bennet, his coachman, and the archbishop walk into a bar but only one of them gets recognized: On the difficulty of detecting characters in literary texts. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 769–774, Lisbon, Portugal. Association for Computational Linguistics.
- Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
- Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML.
- Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5679–5683. IEEE.
- David Wilmot and Frank Keller. 2020. Modelling suspense in short stories as uncertainty reduction over neural representation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1763–1788, Online. Association for Computational Linguistics.
- David Wilmot and Frank Keller. 2021. Memory and knowledge augmented language models for inferring salience in long-form stories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 851–865, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Joseph Worsham and Jugal Kalita. 2018. Genre identification and the compositional effect of genre in literature. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1963–1973, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Visualising the dynamics of character networks. In Digital Humanities, pages 417–419.
- Durian: Duration informed attention network for multimodal synthesis. ArXiv, abs/1909.01700.
- Identifying speakers in children’s stories for speech synthesis. In Eighth European Conference on Speech Communication and Technology.
- Charuta Pethe (7 papers)
- Yunting Yin (5 papers)
- Steven Skiena (49 papers)
- Bach Pham (2 papers)
- Felix D Childress (2 papers)