Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 164 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization (2408.08019v1)

Published 15 Aug 2024 in cs.SD, cs.AI, cs.LG, eess.AS, and eess.SP

Abstract: This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at https://github.com/sh-lee-prml/PeriodWave.

Summary

The paper introduces PeriodWave-Turbo that uses adversarial flow matching optimization to drastically accelerate CFM-based waveform generation.
It achieves superior fidelity and efficiency, with PESQ scores up to 4.454 and significant real-time inference speed improvements on LJSpeech and LibriTTS.
The model’s design, featuring fixed-step generation and multi-scale Mel-spectrogram reconstruction loss, outperforms existing GAN and CFM-based approaches.

Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

The paper "Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization" introduces an innovative approach termed PeriodWave-Turbo. This model aims to enhance the efficiency and quality of waveform generation by leveraging adversarial flow matching optimization.

Introduction

In recent times, conditional flow matching (CFM) generative models have gained traction for waveform generation tasks, primarily due to their high-fidelity output. These models utilize a single vector field estimation objective for training. However, their major drawback is the extended number of ODE steps required during inference, especially compared to GAN-based models that generate outputs in a single step. Additionally, the generated samples often lack high-frequency information, which can be attributed to noisy vector field estimation. This paper addresses these limitations by enhancing pre-trained CFM-based generative models through a fixed-step generator modification, utilizing reconstruction losses, and incorporating adversarial feedback.

Methodology

The core proposition of this paper is the PeriodWave-Turbo, an ODE-based waveform generator. The primary enhancements introduced involve the adversarial flow matching optimization and a shift to a fixed-step generator model. This approach dramatically accelerates waveform generation and improves fidelity. The key contributions of this work can be summarized as:

Development of PeriodWave-Turbo, achieving state-of-the-art performance in waveform generation.
Acceleration of CFM-based models through adversarial flow matching optimization.
Demonstration of superior performance on two-stage TTS pipelines compared to other GAN-based models and pre-trained CFM generators.
Effectiveness shown over various model sizes, with improved performance achieved by scaling up model parameters.
Efficient fine-tuning requiring only 1,000 steps to achieve superior results.

Experimentation and Results

Dataset and Training

The proposed models were trained and validated on the LJSpeech and LibriTTS datasets, renowned benchmarks for waveform generation. The pre-training phase was conducted over 1M steps using the AdamW optimizer, and the models were fine-tuned using adversarial flow matching over much fewer steps.

Objective Evaluation

The performance was evaluated using metrics such as M-STFT, PESQ, V/UV accuracy, pitch error, and UTMOS. Table data reveal that PeriodWave-Turbo consistently outperformed existing models such as HiFi-GAN, BigVGAN, and PriorGrad. Specifically, PeriodWave-Turbo-B achieved a PESQ score of 4.422 and PeriodWave-Turbo-L achieved an unprecedented PESQ of 4.454 on LibriTTS.

Subjective Evaluation and Inference Speed

Subjective evaluation showed that PeriodWave-Turbo models delivered superior MOS scores while maintaining high fidelity and robustness across OOD scenarios. The inference speed was significantly improved, with the four-step Euler sampling method showing enhanced performance and efficiency. For instance, PeriodWave-Turbo-B and PeriodWave-Turbo-L achieved xRT improvements, making them suitable for real-time applications.

Analysis and Ablation Study

An extensive ablation paper solidified the importance of multi-scale Mel-spectrogram loss and adversarial feedback in optimizing the model. The paper evaluated variations in reconstruction losses and distillation methods, concluding the superiority of using multi-scale Mel-spectrogram reconstruction loss combined with adversarial feedback for robust and high-quality waveform generation.

Theoretical and Practical Implications

The research offers both theoretical advancements and practical contributions to the field of waveform generation:

Theoretical Implications: The integration of adversarial feedback with flow matching optimization presents a novel approach to enhancing generative models. This method can be extended to other domains requiring high-fidelity signal generation.
Practical Implications: The significant reduction in training steps and improved inference speed make PeriodWave-Turbo a practical solution for real-time applications, such as speech synthesis in TTS systems and other multimedia applications.

Future Work

Future research could focus on further optimizing the inference speed by integrating various down-sampling methods and adapting the proposed model for end-to-end TTS and text-to-audio generation tasks. The potential for such models in broader applications highlights the need for continual optimization and adaptation.

In conclusion, this paper provides crucial insights and innovations in waveform generation, showcasing the impactful combination of adversarial feedback and flow matching optimization. The robust performance metrics and practical applications underscore the value of this research in advancing the capabilities of generative models.