Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 30 tok/s Pro
2000 character limit reached

High-Fidelity Neural Phonetic Posteriorgrams (2402.17735v1)

Published 27 Feb 2024 in eess.AS and cs.SD

Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Workshop on Automatic Speech Recognition & Understanding, 2009.
  2. “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in International Conference on Multimedia and Expo, 2016.
  3. “Any-to-many voice conversion with location-relative sequence-to-sequence modeling,” Transactions on Audio, Speech, and Language Processing, 2021.
  4. “Any-to-any voice conversion with F0 and timbre disentanglement and novel timbre conditioning,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
  5. “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion,” in ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, 2020.
  6. “AdaVITS: Tiny VITS for low computing resource speaker adaptation,” in International Symposium on Chinese Spoken Language Processing, 2022.
  7. “Foreign accent conversion by synthesizing speech from phonetic posteriorgrams,” in Interspeech, 2019.
  8. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning, 2021.
  9. Ken Shoemake, “Animating rotation with quaternion curves,” in SIGGRAPH, 1985.
  10. A course in phonetics, Cengage learning, 2014.
  11. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  12. “Phone-to-audio alignment without text: A semi-supervised approach,” in International Conference on Acoustics, Speech and Signal Processing, 2022.
  13. “Diphone collection and synthesis,” in International Conference on Spoken Language Processing, 2000.
  14. “Attention is all you need,” in Neural Information Processing Systems, 2017.
  15. “Common Voice: A massively-multilingual speech corpus,” in International Conference on Language Resources and Evaluation, 2020.
  16. “The impact of neural network overparameterization on gradient confusion and stochastic gradient descent,” in International Conference on Machine Learning, 2020.
  17. “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
  18. “On batching variable size inputs for training end-to-end speech enhancement systems,” in International Conference on Acoustics, Speech and Signal Processing, 2023.
  19. “Cross-domain neural pitch and periodicity estimation,” arXiv preprint arXiv:2301.12258, 2023.
  20. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Neural Information Processing Systems, 2020.
  21. “Montreal forced aligner: Trainable text-speech alignment using Kaldi.,” in Interspeech, 2017.
  22. “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  23. “The CMU Arctic speech databases,” in ISCA Workshop on Speech Synthesis, 2004.
  24. “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report, 1993.
  25. “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
  26. “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  27. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023.
  28. “Reproducible subjective evaluation,” in ICLR Workshop on ML Evaluation Standards, 2023.
  29. International Telecommunication Union, “Method for the subjective assessment of intermediate sound quality,” 2001.
  30. “Neural representations for modeling variation in speech,” Journal of Phonetics, 2022.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel, interpretable PPG representation that enhances speech synthesis by independently controlling pitch and pronunciation.
  • It utilizes a convolutional and Transformer-based architecture alongside a new Jensen–Shannon divergence metric to optimize PPG fidelity.
  • Comprehensive evaluations, including MUSHRA tests, validate its superior performance in voice conversion, pronunciation editing, and overall speech quality.

Enhancing Speech Synthesis with High-Fidelity Neural Phonetic Posteriorgrams

Introduction to Phonetic Posteriorgrams (PPGs)

Phonetic Posteriorgrams (PPGs) represent a categorical distribution over acoustic units of speech, such as phonemes. Their utility is widely recognized in applications like voice conversion and pronunciation editing by enabling a separation of pronunciation features from speaker identity. This paper introduces an advanced interpretable PPG representation that elevates the quality and utility of PPGs in speech synthesis, marking a significant advancement in the technology. Employing these high-quality PPGs allows independent control over pitch and pronunciation, furthering the capabilities in voice conversion, pronunciation interpolation, and phoneme editing.

Network Architecture and Training

The paper outlines the network architecture designed for inferring PPGs, utilizing a convolution layer, Transformer encoder layers, and an output convolution layer that categorizes into 40 phonemes. A detailed exploration into diverse audio input encodings unveils the impact of each on the performance of PPGs, contributing valuable insights into optimizing PPG fidelity. The research methodically assesses different network configurations to arrive at an optimal setting, ensuring the highest quality PPG generation.

Comprehensive Evaluation

Objective Evaluation

The paper presents a comprehensive three-part evaluation framework to assess:

  • The accuracy of PPGs against various audio input representations.
  • The efficacy in disentangling pitch and pronunciation.
  • The subjective quality of synthesized speech from the proposed PPGs.

Each section is meticulously designed to evaluate the PPG performance and its implications on speech synthesis. Notably, the paper introduces a novel interpretable speech pronunciation distance metric based on the Jensen-Shannon divergence between PPGs, setting a new standard for evaluating pronunciation accuracy and disentanglement.

Subjective Evaluation

Beyond objective measures, the research extends into subjective evaluations, employing a MUSHRA-type listening test to gauge the quality of speech synthesis from different PPG representations. This dual approach in evaluation rigorously validates the strengths of the proposed PPG model, ensuring that the findings are solid and applicable in real-world use cases.

Novel Contributions and Future Directions

Contributions

This paper makes three key contributions to the field:

  1. An interpretable PPG representation that offers competitive pitch modification and superior interpretability.
  2. A novel metric for measuring pronunciation distance, fostering language-agnostic pronunciation analysis.
  3. Demonstrated capability of interpretable PPGs in enabling fine-grained control over pronunciation, opening new possibilities in speech synthesis and editing.

Future Directions

While the paper presents significant advancements, it also outlines areas for future research, particularly in exploring the full potential of PPGs in applications such as accent coaching and mispronunciation detection. The implications of this work extend beyond immediate improvements in speech synthesis technology, paving the way for new applications and advancements in linguistics and AI-driven audio editing.

Conclusion

The paper introduces a sophisticated interpretable PPG representation that significantly enhances speech synthesis quality and control over speech characteristics. Through rigorous evaluation and innovative contributions, this research lays a strong foundation for future advancements in speech synthesis and editing technologies. The release of the code and PPG representations as an open-source package further amplifies the potential impact of this work, enabling wider adoption and exploration in the research community.

X Twitter Logo Streamline Icon: https://streamlinehq.com