- The paper introduces PeriodWave-Turbo that uses adversarial flow matching optimization to drastically accelerate CFM-based waveform generation.
- It achieves superior fidelity and efficiency, with PESQ scores up to 4.454 and significant real-time inference speed improvements on LJSpeech and LibriTTS.
- The model’s design, featuring fixed-step generation and multi-scale Mel-spectrogram reconstruction loss, outperforms existing GAN and CFM-based approaches.
The paper "Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization" introduces an innovative approach termed PeriodWave-Turbo. This model aims to enhance the efficiency and quality of waveform generation by leveraging adversarial flow matching optimization.
Introduction
In recent times, conditional flow matching (CFM) generative models have gained traction for waveform generation tasks, primarily due to their high-fidelity output. These models utilize a single vector field estimation objective for training. However, their major drawback is the extended number of ODE steps required during inference, especially compared to GAN-based models that generate outputs in a single step. Additionally, the generated samples often lack high-frequency information, which can be attributed to noisy vector field estimation. This paper addresses these limitations by enhancing pre-trained CFM-based generative models through a fixed-step generator modification, utilizing reconstruction losses, and incorporating adversarial feedback.
Methodology
The core proposition of this paper is the PeriodWave-Turbo, an ODE-based waveform generator. The primary enhancements introduced involve the adversarial flow matching optimization and a shift to a fixed-step generator model. This approach dramatically accelerates waveform generation and improves fidelity. The key contributions of this work can be summarized as:
- Development of PeriodWave-Turbo, achieving state-of-the-art performance in waveform generation.
- Acceleration of CFM-based models through adversarial flow matching optimization.
- Demonstration of superior performance on two-stage TTS pipelines compared to other GAN-based models and pre-trained CFM generators.
- Effectiveness shown over various model sizes, with improved performance achieved by scaling up model parameters.
- Efficient fine-tuning requiring only 1,000 steps to achieve superior results.
Experimentation and Results
Dataset and Training
The proposed models were trained and validated on the LJSpeech and LibriTTS datasets, renowned benchmarks for waveform generation. The pre-training phase was conducted over 1M steps using the AdamW optimizer, and the models were fine-tuned using adversarial flow matching over much fewer steps.
Objective Evaluation
The performance was evaluated using metrics such as M-STFT, PESQ, V/UV accuracy, pitch error, and UTMOS. Table data reveal that PeriodWave-Turbo consistently outperformed existing models such as HiFi-GAN, BigVGAN, and PriorGrad. Specifically, PeriodWave-Turbo-B achieved a PESQ score of 4.422 and PeriodWave-Turbo-L achieved an unprecedented PESQ of 4.454 on LibriTTS.
Subjective Evaluation and Inference Speed
Subjective evaluation showed that PeriodWave-Turbo models delivered superior MOS scores while maintaining high fidelity and robustness across OOD scenarios. The inference speed was significantly improved, with the four-step Euler sampling method showing enhanced performance and efficiency. For instance, PeriodWave-Turbo-B and PeriodWave-Turbo-L achieved xRT improvements, making them suitable for real-time applications.
Analysis and Ablation Study
An extensive ablation paper solidified the importance of multi-scale Mel-spectrogram loss and adversarial feedback in optimizing the model. The paper evaluated variations in reconstruction losses and distillation methods, concluding the superiority of using multi-scale Mel-spectrogram reconstruction loss combined with adversarial feedback for robust and high-quality waveform generation.
Theoretical and Practical Implications
The research offers both theoretical advancements and practical contributions to the field of waveform generation:
- Theoretical Implications: The integration of adversarial feedback with flow matching optimization presents a novel approach to enhancing generative models. This method can be extended to other domains requiring high-fidelity signal generation.
- Practical Implications: The significant reduction in training steps and improved inference speed make PeriodWave-Turbo a practical solution for real-time applications, such as speech synthesis in TTS systems and other multimedia applications.
Future Work
Future research could focus on further optimizing the inference speed by integrating various down-sampling methods and adapting the proposed model for end-to-end TTS and text-to-audio generation tasks. The potential for such models in broader applications highlights the need for continual optimization and adaptation.
In conclusion, this paper provides crucial insights and innovations in waveform generation, showcasing the impactful combination of adversarial feedback and flow matching optimization. The robust performance metrics and practical applications underscore the value of this research in advancing the capabilities of generative models.