High-Fidelity Neural Phonetic Posteriorgrams

Published 27 Feb 2024 in eess.AS and cs.SD | (2402.17735v1)

Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel, interpretable PPG representation that enhances speech synthesis by independently controlling pitch and pronunciation.
It utilizes a convolutional and Transformer-based architecture alongside a new Jensen–Shannon divergence metric to optimize PPG fidelity.
Comprehensive evaluations, including MUSHRA tests, validate its superior performance in voice conversion, pronunciation editing, and overall speech quality.

Enhancing Speech Synthesis with High-Fidelity Neural Phonetic Posteriorgrams

Introduction to Phonetic Posteriorgrams (PPGs)

Phonetic Posteriorgrams (PPGs) represent a categorical distribution over acoustic units of speech, such as phonemes. Their utility is widely recognized in applications like voice conversion and pronunciation editing by enabling a separation of pronunciation features from speaker identity. This paper introduces an advanced interpretable PPG representation that elevates the quality and utility of PPGs in speech synthesis, marking a significant advancement in the technology. Employing these high-quality PPGs allows independent control over pitch and pronunciation, furthering the capabilities in voice conversion, pronunciation interpolation, and phoneme editing.

Network Architecture and Training

The paper outlines the network architecture designed for inferring PPGs, utilizing a convolution layer, Transformer encoder layers, and an output convolution layer that categorizes into 40 phonemes. A detailed exploration into diverse audio input encodings unveils the impact of each on the performance of PPGs, contributing valuable insights into optimizing PPG fidelity. The research methodically assesses different network configurations to arrive at an optimal setting, ensuring the highest quality PPG generation.

Comprehensive Evaluation

Objective Evaluation

The paper presents a comprehensive three-part evaluation framework to assess:

The accuracy of PPGs against various audio input representations.
The efficacy in disentangling pitch and pronunciation.
The subjective quality of synthesized speech from the proposed PPGs.

Each section is meticulously designed to evaluate the PPG performance and its implications on speech synthesis. Notably, the paper introduces a novel interpretable speech pronunciation distance metric based on the Jensen-Shannon divergence between PPGs, setting a new standard for evaluating pronunciation accuracy and disentanglement.

Subjective Evaluation

Beyond objective measures, the research extends into subjective evaluations, employing a MUSHRA-type listening test to gauge the quality of speech synthesis from different PPG representations. This dual approach in evaluation rigorously validates the strengths of the proposed PPG model, ensuring that the findings are solid and applicable in real-world use cases.

Novel Contributions and Future Directions

Contributions

This paper makes three key contributions to the field:

An interpretable PPG representation that offers competitive pitch modification and superior interpretability.
A novel metric for measuring pronunciation distance, fostering language-agnostic pronunciation analysis.
Demonstrated capability of interpretable PPGs in enabling fine-grained control over pronunciation, opening new possibilities in speech synthesis and editing.

Future Directions

While the paper presents significant advancements, it also outlines areas for future research, particularly in exploring the full potential of PPGs in applications such as accent coaching and mispronunciation detection. The implications of this work extend beyond immediate improvements in speech synthesis technology, paving the way for new applications and advancements in linguistics and AI-driven audio editing.

Conclusion

The paper introduces a sophisticated interpretable PPG representation that significantly enhances speech synthesis quality and control over speech characteristics. Through rigorous evaluation and innovative contributions, this research lays a strong foundation for future advancements in speech synthesis and editing technologies. The release of the code and PPG representations as an open-source package further amplifies the potential impact of this work, enabling wider adoption and exploration in the research community.

Markdown