High-Fidelity Neural Phonetic Posteriorgrams (2402.17735v1)
Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.
- “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Workshop on Automatic Speech Recognition & Understanding, 2009.
- “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in International Conference on Multimedia and Expo, 2016.
- “Any-to-many voice conversion with location-relative sequence-to-sequence modeling,” Transactions on Audio, Speech, and Language Processing, 2021.
- “Any-to-any voice conversion with F0 and timbre disentanglement and novel timbre conditioning,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion,” in ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, 2020.
- “AdaVITS: Tiny VITS for low computing resource speaker adaptation,” in International Symposium on Chinese Spoken Language Processing, 2022.
- “Foreign accent conversion by synthesizing speech from phonetic posteriorgrams,” in Interspeech, 2019.
- “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning, 2021.
- Ken Shoemake, “Animating rotation with quaternion curves,” in SIGGRAPH, 1985.
- A course in phonetics, Cengage learning, 2014.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- “Phone-to-audio alignment without text: A semi-supervised approach,” in International Conference on Acoustics, Speech and Signal Processing, 2022.
- “Diphone collection and synthesis,” in International Conference on Spoken Language Processing, 2000.
- “Attention is all you need,” in Neural Information Processing Systems, 2017.
- “Common Voice: A massively-multilingual speech corpus,” in International Conference on Language Resources and Evaluation, 2020.
- “The impact of neural network overparameterization on gradient confusion and stochastic gradient descent,” in International Conference on Machine Learning, 2020.
- “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
- “On batching variable size inputs for training end-to-end speech enhancement systems,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Cross-domain neural pitch and periodicity estimation,” arXiv preprint arXiv:2301.12258, 2023.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Neural Information Processing Systems, 2020.
- “Montreal forced aligner: Trainable text-speech alignment using Kaldi.,” in Interspeech, 2017.
- “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
- “The CMU Arctic speech databases,” in ISCA Workshop on Speech Synthesis, 2004.
- “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report, 1993.
- “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
- “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023.
- “Reproducible subjective evaluation,” in ICLR Workshop on ML Evaluation Standards, 2023.
- International Telecommunication Union, “Method for the subjective assessment of intermediate sound quality,” 2001.
- “Neural representations for modeling variation in speech,” Journal of Phonetics, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.