PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation (2408.07547v1)

Published 14 Aug 2024 in cs.SD, cs.AI, cs.LG, eess.AS, and eess.SP

Abstract: Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \url{https://github.com/sh-lee-prml/PeriodWave}.

PDF HTML Abstract

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

The paper entitled "PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation" presents a novel approach in the domain of waveform generation by introducing a specialized model known as PeriodWave. This work addresses significant challenges inherent in AI-driven waveform generation, particularly the need for a robust method to handle the periodic features of waveform signals which traditional architectures often neglect.

Waveform generation tasks commonly employ GAN-based models for their speed despite their propensity to produce artifacts under mismatched train-inference scenarios. Conversely, while diffusion-based models promise high-fidelity outputs, their slow sampling speed has restricted their application within real-time settings. PeriodWave emerges as a potential solution, bridging these limitations through a period-aware flow matching estimator coupled with discrete wavelet transform (DWT).

Key Contributions and Methodology

The authors introduce a period-aware flow matching estimator designed to disentangle and harness waveform signal periodicity. Implementing multiple periods, using prime numbers to avoid overlap, is a key aspect in capturing distinct periodic features from input signals. This multi-period approach is complemented by a single period-conditional estimator, facilitating efficient inference through batch processing.

The model further incorporates discrete wavelet transformations to effectively separate frequency bands within waveforms. Use of DWT is pivotal for maintaining the integrity of frequency information, particularly in high-frequency modeling, thus circumventing traditional limitations of diffusion models that struggle with high-frequency details. The paper also describes the FreeU module, aimed at mitigating high-frequency noise, which is notable in improving signal clarity.

Experimental Evaluation

The empirical evaluations in the paper are robust, showcasing the superiority of PeriodWave compared to established baselines across various tasks: Mel-spectrogram reconstruction and text-to-speech synthesis. PeriodWave achieves formidable results, demonstrating significant improvements in metrics related to pitch accuracy and periodicity, and delivering these advancements with markedly reduced training times. Its successful application in both speech and out-of-distribution sample generation further illustrates its versatility and robustness.

A particularly bold claim accompanying these findings is the proposed model's efficiency, requiring only three days of training—compared to several weeks for GAN counterparts—while still outperforming them in critical performance metrics. This highlights notable computational cost benefits presented through its design.

Implications and Future Directions

The advancements described in this paper hold significant implications for the broader field of AI-driven audio synthesis. PeriodWave's architectural considerations and methodical enhancements indicate a valuable direction for future research into computationally efficient yet flexible waveform generators. Prospective studies could explore adaptations of this model to more varied and complex audio generation tasks or assess its integration within real-time systems.

Furthermore, potential developments may look into reducing the synthesis speed further while maintaining, or even enhancing, the output fidelity. With the release of code and model checkpoints, PeriodWave positions itself as a valuable asset for further exploration and application within artificial intelligence frameworks focused on audio synthesis.

Overall, this paper provides a detailed and technical exposition of a model poised to impact universal vocoder approaches in the landscape of AI-generated audio, paving the way for more innovative solutions in waveform modeling and synthesis.