High-Fidelity Neural Phonetic Posteriorgrams
Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.
- “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Workshop on Automatic Speech Recognition & Understanding, 2009.
- “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in International Conference on Multimedia and Expo, 2016.
- “Any-to-many voice conversion with location-relative sequence-to-sequence modeling,” Transactions on Audio, Speech, and Language Processing, 2021.
- “Any-to-any voice conversion with F0 and timbre disentanglement and novel timbre conditioning,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion,” in ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, 2020.
- “AdaVITS: Tiny VITS for low computing resource speaker adaptation,” in International Symposium on Chinese Spoken Language Processing, 2022.
- “Foreign accent conversion by synthesizing speech from phonetic posteriorgrams,” in Interspeech, 2019.
- “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning, 2021.
- Ken Shoemake, “Animating rotation with quaternion curves,” in SIGGRAPH, 1985.
- A course in phonetics, Cengage learning, 2014.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- “Phone-to-audio alignment without text: A semi-supervised approach,” in International Conference on Acoustics, Speech and Signal Processing, 2022.
- “Diphone collection and synthesis,” in International Conference on Spoken Language Processing, 2000.
- “Attention is all you need,” in Neural Information Processing Systems, 2017.
- “Common Voice: A massively-multilingual speech corpus,” in International Conference on Language Resources and Evaluation, 2020.
- “The impact of neural network overparameterization on gradient confusion and stochastic gradient descent,” in International Conference on Machine Learning, 2020.
- “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
- “On batching variable size inputs for training end-to-end speech enhancement systems,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
- “Cross-domain neural pitch and periodicity estimation,” arXiv preprint arXiv:2301.12258, 2023.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Neural Information Processing Systems, 2020.
- “Montreal forced aligner: Trainable text-speech alignment using Kaldi.,” in Interspeech, 2017.
- “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
- “The CMU Arctic speech databases,” in ISCA Workshop on Speech Synthesis, 2004.
- “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report, 1993.
- “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
- “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023.
- “Reproducible subjective evaluation,” in ICLR Workshop on ML Evaluation Standards, 2023.
- International Telecommunication Union, “Method for the subjective assessment of intermediate sound quality,” 2001.
- “Neural representations for modeling variation in speech,” Journal of Phonetics, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.