Period VITS: Pitch Modeling for Emotional TTS

Updated 5 April 2026

The paper introduces explicit periodicity modeling via sample-level sinusoidal sources that stabilize pitch and enhance naturalness.
The architecture combines a text-conditional VAE with normalizing flows and adversarial training to jointly optimize spectral content and prosody.
Experimental results show Period VITS achieves MOS ratings near natural recordings, outperforming baseline TTS systems in expressive contexts.

Period VITS (Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis) is an end-to-end text-to-speech (TTS) architecture designed to address instability and artifacts in pitch generation, particularly for emotional speech datasets exhibiting a high diversity of prosody and pronunciation. Developed as an extension of the VITS framework, Period VITS introduces an explicit modeling of periodicity via sample-level sinusoidal sources derived from predicted pitch and voicing features. The model jointly optimizes spectral content and prosody within a variational inference and adversarial training paradigm, resulting in output speech that is both more natural and pitch-stable under expressive conditions (Shirahata et al., 2022).

1. Architectural Overview

Period VITS is grounded in VITS, which implements a text-conditional variational autoencoder (VAE) with a normalizing-flow–augmented prior and a HiFi-GAN–inspired non-autoregressive waveform decoder. The architecture introduces two latent variables: $z$ , encoding spectral content, and $y$ , explicitly encoding prosody (pitch and voicing). The key innovation is the explicit generation of a sample-level periodic source informed by a frame pitch predictor, whose outputs condition the waveform decoder directly. The encoder stack comprises (i) a text encoder that processes phoneme sequences alongside optional speaker, emotion, and accent embeddings, (ii) a frame prior network (FPN) of six one-dimensional residual convolutional layers to upsample phoneme-level inputs to frame-level prior parameters $(\mu(c), \sigma(c))$ , and (iii) a series of normalizing flows augmenting the prior $p(z|c)$ :

$p(z|c)=\mathcal{N}(f(z);\mu(c),\sigma(c))\left|\det\frac{\partial f(z)}{\partial z}\right|.$

The frame pitch predictor consists of five 1-D convolutional layers (kernel size 5, dropout 0.3) that, with speaker and emotion conditioning, output predicted frame-level $F_0$ and a binary voicing flag $\hat v$ .

2. Mathematical Formulation

Training optimizes a composite objective involving variational and adversarial losses. The evidence lower bound (ELBO) for the conditional log-likelihood $\log p(x|c)$ is:

$\log p(x|c) = \log\iint p(x,z,y|c)\,dz\,dy \geq \mathbb{E}_{q(z|x)q(y|x)}[\log p(x|z,y)] - D_{KL}(q(z|x)\|p(z|c)) - D_{KL}(q(y|x)\|p(y|c)).$

With $q(y|x)=\delta(y - y^{gt})$ , the final term reduces to a supervised L2 pitch and voicing error:

$y$ 0

Period VITS adopts least-squares GAN adversarial losses. Discriminator and generator losses are:

Discriminator:

$y$ 1

Generator adversarial:

$y$ 2

Feature-matching loss is computed as an L1 distance over internal discriminator features.

Total loss:

$y$ 3

with multiplicative weights: $y$ 4, $y$ 5, $y$ 6, $y$ 7, $y$ 8, $y$ 9.

3. Explicit Pitch and Periodicity Modeling

Period VITS directly synthesizes periodic structure by converting predicted frame-level $(\mu(c), \sigma(c))$ 0 and voicing into a sample-level sinusoidal source, $(\mu(c), \sigma(c))$ 1, where:

$(\mu(c), \sigma(c))$ 2

with $(\mu(c), \sigma(c))$ 3 kHz and $(\mu(c), \sigma(c))$ 4 as initial phase. Sample-level $(\mu(c), \sigma(c))$ 5 and $(\mu(c), \sigma(c))$ 6 are generated via nearest-neighbor upsampling of frame-level predictions. The periodicity generator stacks $(\mu(c), \sigma(c))$ 7, with $(\mu(c), \sigma(c))$ 8, and down-samples this via transposed convolutions aligned to the waveform decoder’s upsampling schedule ([6, 5, 2, 2, 2]), producing multi-scale periodic features added at each upsample block in the HiFi-GAN decoder. This explicit periodic signal injection acts as a pitch anchor, enforcing stable harmonic structure even under expressive emotional prosody.

4. Training Methodology

Training is performed in a fully end-to-end manner: from text $(\mu(c), \sigma(c))$ 9 through the prior encoder(s) to the decoder $p(z|c)$ 0, backpropagating all reconstruction, latent, GAN, and duration losses without need for separate pretraining of a vocoder or acoustic model. Training uses Adam optimizer with default $p(z|c)$ 1's and learning rate $p(z|c)$ 2, dynamic batch sizing (average 26 utterances), a fixed frame shift and window (10 ms / 40 ms, corresponding to 240 samples per frame at 24 kHz), and continues for $p(z|c)$ 3– $p(z|c)$ 4 steps until convergence. Duration alignment is supervised via $p(z|c)$ 5, computed on manually labeled phoneme durations.

5. Experimental Protocol and Performance Metrics

Experiments are conducted on a Japanese emotional speech corpus comprising 15 professional speakers (5 male, 10 female), each contributing data in three styles: neutral (4000 utterances), happy (1000), and sad (1000); 50 utterances per style are set aside for validation and test. For end-to-end models, the input is a 513-dimensional linear spectrogram; for cascade baselines, an 80-dimensional log-Mel plus appended pitch/voicing is used. Ground-truth $p(z|c)$ 6 and voicing are extracted using the ITFTE vocoder. Evaluation metrics include:

Mean Opinion Score (MOS) for naturalness, rated (1–5) by 18 native speakers, over 180 audio samples per system.
Pitch stability, assessed via visual inspection of spectrogram harmonics for consistency.

6. Comparative Results and Analysis

Period VITS achieves MOS on par with natural recordings for neutral and sad speech:

System	Neutral	Happy	Sad
Reference	4.66 ± 0.04	4.77 ± 0.04	4.71 ± 0.04
VITS	2.78 ± 0.07	2.82 ± 0.08	3.24 ± 0.08
FPN-VITS	3.85 ± 0.07	3.69 ± 0.07	3.77 ± 0.07
CAT-P-VITS	3.79 ± 0.07	3.63 ± 0.07	3.66 ± 0.07
Sine-P-VITS	4.63 ± 0.04	4.54 ± 0.04	4.68 ± 0.04
P-VITS (ours)	4.66 ± 0.04	4.62 ± 0.04	4.69 ± 0.04
FS2+P-HiFi-GAN	4.50 ± 0.05	4.18 ± 0.06	4.40 ± 0.05

No statistically significant MOS difference at the 5% level is observed between Period VITS and reference speech in neutral/sad categories (Student’s $p(z|c)$ 7-test). Systems that leverage frame-level pitch only (CAT-P-VITS) show limited improvement, whereas sample-level sinusoidal injection (P-VITS) delivers substantial advancement in pitch stability and naturalness. Qualitative visualizations indicate Period VITS generates harmonics that remain straight and stable, closely replicating reference signals, in contrast to “wobbly” pitch phenomena evident in weaker baselines.

7. Significance and Implications

Period VITS demonstrates that explicit modeling of pitch and voicing through sample-level periodicity features, combined with joint optimization in a variational-adversarial architecture, can eliminate pitch artifacts and instability in synthetic speech across a range of expressive styles. The end-to-end design obviates the need for separate acoustic and neural vocoder stages, simplifying the pipeline while matching or exceeding the quality of strong TTS baselines on both neutral and emotional corpora. A plausible implication is that the method’s sample-level periodicity modeling may become essential in future high-fidelity TTS architectures dealing with diverse and highly expressive datasets (Shirahata et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Period VITS.