PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS (2302.12391v3)

Published 24 Feb 2023 in eess.AS, cs.LG, and cs.SD

Abstract: Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code, audio samples, and demo are available at https://github.com/anonymous-pits/pits.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel TTS model that achieves robust pitch control without relying on direct fundamental frequency modeling.
The paper leverages variational inference with Yingram encoding to enhance pitch diversity and improve synthesis naturalness.
Experimental results demonstrate that PITS outperforms models like VITS and FastSpeech 2 in maintaining natural intonation under pitch shifts.

Insights into PITS: Variational Pitch Inference Without Fundamental Frequency for End-to-End Pitch-Controllable TTS

The paper introduces PITS, a novel text-to-speech (TTS) model aimed at achieving robust pitch control without relying on direct fundamental frequency ( $f_0$ ) modeling. Unlike previous TTS models that are constrained by direct $f_0$ modeling leading to low pitch variance, PITS applies variational inference techniques to gain unprecedented control over pitch while maintaining naturalness and high-quality synthesis.

Technical Approach and Enhancements

PITS builds upon the framework of VITS, implementing several key innovations:

Yingram Encoding and Decoding: The heart of PITS is the introduction of the Yingram encoder and decoder, which capture pitch-related information without needing to model $f_0$ . The Yingram is an acoustic feature derived from the YIN algorithm to represent pitch along with harmonics, targeting the challenges associated with $f_0$ which may be undefined or misleading in certain contexts.
Variational Inference Framework: By employing a conditional variational auto-encoder (VAE), PITS is able to diversify pitch representations dynamically. The Yingram encoder contributes to the posterior and allows for pitching control by shifting latent variables during synthesis, a concept not utilized in traditional $f_0$ based models. This approach provides a richer spectrum of variations, essential for nuanced TTS outputs.
Pitch-Shifted Waveform Synthesis: A notable innovation is the pitch-sliding mechanism, where waveforms are systematically synthesized with pitch-shifts to ensure the model's robustness in handling variable pitch. This is bolstered by the Yingram reconstruction loss and adversarial pitch-shifted losses, which guide the model to maintain audio quality even under pitch transformations.
Q-VAE and Single-Stage Training: The use of quantized variational auto-encoders (Q-VAE) aimed to disentangle pitch and linguistic content by quantizing the output of the STFT encoder. Although this component displayed potential during exploratory stages, experiments revealed significant quality degradation when applied, suggesting potential areas for future optimization.

Experimental Validation and Implications

Empirical results underline the efficacy of PITS in synthesizing high-quality, pitch-controllable speech. The approach was rigorously tested against contemporary TTS models like VITS and FastSpeech 2, showcasing superior naturalness and pitch versatility with minimal degradation in intelligibility or speaker identity. In systematic evaluations, PITS maintained competitive mean opinion scores (MOS) while offering unmatched pitch flexibility—an achievement partly credited to the adversarial training strategies used.

Broader Implications and Future Considerations

The advent of PITS sets an advanced precedent for pitch modeling in TTS systems. By circumventing the traditional dependencies on $f_0$ , this work paves the way for TTS systems that can handle diverse stylistic demands, language intonations, and expressive deliveries more effectively.

Future extensions could involve refining the integration of Yingram features with Q-VAE elements to enhance expressiveness without quality loss. Moreover, leveraging the Yingram approach for cross-linguistic TTS applications may unlock further potential given the universality of harmonic structures captured therein.

In conclusion, PITS represents an essential step forward in text-to-speech technology, articulating how variational methods can effectively expand the expressiveness of automated speech synthesis by decoupling and diversifying pitch content without compromising speech naturalness or intelligibility. This fosters a rich ground for future explorations into integrating more sophisticated pitch dynamics into AI-driven audio and language processing systems.

PDF Markdown

Related Papers

GitHub

GitHub - anonymous-pits/pits: PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor (277 stars)