Papers
Topics
Authors
Recent
Search
2000 character limit reached

PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping

Published 8 Nov 2022 in eess.AS, cs.AI, cs.SD, and eess.SP | (2211.04610v2)

Abstract: Previous generative adversarial network (GAN)-based neural vocoders are trained to reconstruct the exact ground truth waveform from the paired mel-spectrogram and do not consider the one-to-many relationship of speech synthesis. This conventional training causes overfitting for both the discriminators and the generator, leading to the periodicity artifacts in the generated audio signal. In this work, we present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping. With our proposed method, we outperform baselines without any architecture modification. Code and audio samples will be available at https://github.com/mindslab-ai/phaseaug.

Citations (11)

Summary

  • The paper introduces PhaseAug, which augments GAN-based vocoder training by rotating audio phases to simulate one-to-many mapping.
  • It demonstrates improved metrics, including a 3% reduction in periodicity error and enhanced performance with less training data.
  • The approach requires no architectural changes, offering a versatile tool to mitigate overfitting and boost speech synthesis quality.

PhaseAug: A Differentiable Augmentation for Speech Synthesis

The paper "PhaseAug: A Differentiable Augmentation for Speech Synthesis" introduces a novel methodology aimed at addressing specific challenges within the domain of GAN-based neural vocoders. The authors identify two key limitations in existing systems: discrimination overfitting and the assumption of one-to-one mapping from mel-spectrograms to raw waveforms.

Core Contributions

The primary contribution of the paper is the introduction of PhaseAug, a differentiable augmentation approach that facilitates a more robust training environment for speech synthesis models. PhaseAug operates by rotating the phase of each frequency bin within the audio signal, thereby simulating a one-to-many mapping relationship. This approach allows the augmentation of data during training, potentially alleviating issues of overfitting experienced by discriminators within GAN frameworks, such as HiFi-GAN and MelGAN.

Numerical Results

The authors provide compelling evidence of PhaseAug's efficacy across different training conditions. Key metrics such as Mean Absolute Error (MAE), Periodicity Error, and MOS were utilized for evaluation. Importantly, models trained with PhaseAug displayed improved periodicity metrics, with reductions averaging around 3%. This decrease is significant as it directly correlates to reduced perceptible artifacts in generated audio signals. Furthermore, even with reduced datasets (e.g., 10% and 1% of standard data volume), models integrated with PhaseAug outperformed their non-augmented counterparts, particularly evident in the pairwise preference evaluation where models with PhaseAug were favored.

Practical and Theoretical Implications

From a practical standpoint, PhaseAug presents a straightforward yet effective enhancement for existing GAN-based vocoder architectures. It is noteworthy that PhaseAug's implementation does not require architectural modifications, implying ease of integration into current systems with minimal overhead. This positions PhaseAug as a valuable tool for improving vocoder performance, especially in scenarios where data volume is a limiting factor.

Theoretically, the introduction of a phase-based augmentation strategy opens new avenues for research in phase processing within neural vocoders. By demonstrating that discriminators can generalize better with augmented data that accounts for phase variability, PhaseAug challenges the conventional assumptions about the rigidity of vocoder training paradigms.

Speculation on Future Developments

As the field of speech synthesis and enhancement progresses, PhaseAug could serve as a pivotal step towards addressing more complex one-to-many mapping problems inherent in speech processing, such as in end-to-end text-to-speech systems. The authors suggest potential applications in neural audio upsampling, indicating a broader scope for PhaseAug beyond its initial implementation.

Additionally, the notion of extending PhaseAug for shift-equivariant models holds promise. By employing PhaseAug for differentiable time-shift operations, future research could explore its applicability in refining time-domain features for models beyond neural vocoding, contributing to advancements in alias-free GAN architectures.

Conclusion

In summary, PhaseAug represents a significant contribution to the field of speech synthesis, addressing long-standing issues of discriminator overfitting and mapping inflexibility. Its integration into GAN-based vocoders results in noticeable improvements in objective and subjective metrics, while its broader potential applications could influence future research trajectories in audio processing and synthesis. This work aligns with ongoing efforts in utilizing data augmentation to enhance generalization across machine learning models, particularly within the intricacies of audio signal processing domains.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.