- The paper introduces Avocodo, a GAN-based vocoder that eliminates aliasing and imaging artifacts in speech synthesis through innovative discriminator designs.
- It employs a collaborative multi-band discriminator and a sub-band discriminator with PQMF to enhance spectral accuracy and mitigate artifacts.
- Experimental results demonstrate superior MOS, reduced F0 RMSE, and efficient inference compared to state-of-the-art models, highlighting its practical impact.
Avocodo: Generative Adversarial Network for Artifact-Free Vocoder
The paper presents Avocodo, a novel approach using Generative Adversarial Networks (GANs) to create artifact-free vocoders that effectively synthesize high-fidelity speech. GAN-based vocoders, favored for their fast inference and high-quality outputs, face challenges with artifacts like aliasing and imaging distortions, primarily due to their focus on low-frequency bands in multi-scale analyses. The authors identify these issues through preliminary experiments and propose a solution with Avocodo that incorporates unique discriminative methodologies to address these challenges.
Key Innovations
The authors introduce two main innovations in Avocodo:
- Collaborative Multi-Band Discriminator (CoMBD):
- This discriminator uses a collaborative structure integrating multi-scale and hierarchical evaluations. It focuses on both full-resolution waveforms and intermediate outputs, thereby helping the generator manage spectral features across multiple resolutions and mitigate upsampling artifacts.
- Multi-scale evaluations focus on low-frequency bands crucial for perceptual speech quality, while the hierarchical structure enhances this approach by utilizing intermediate outputs to guide the generator in balancing expansion and filtering, effectively addressing the artifact causes.
- Sub-Band Discriminator (SBD):
- SBD operates on multiple sub-band signals, decomposed via a pseudo quadrature mirror filter bank (PQMF). This exploration in the frequency domain across time and frequency axes allows the generator to focus on a comprehensive speech spectrum.
- Utilizing PQMF prevents aliasing in the downsampling process, a previously observed cause of harmonic distortion and inaccuracy in fundamental frequency (F\textsubscript{0}).
Experimental Insights
The paper provides a thorough comparison between Avocodo and existing state-of-the-art models like HiFi-GAN, MelGAN, and VocGAN across single and unseen speaker datasets. Avocodo demonstrates superior performance both subjectively and objectively. Notably:
- MOS Analysis: Avocodo achieves higher MOS across both single-speaker and unseen speaker assignments, demonstrating consistent speech synthesis quality and robustness against variability in speaker inputs.
- Artifact Mitigation: Objective metrics like F\textsubscript{0} RMSE and PESQ highlight Avocodo's proficiency in synthesizing accurate frequency components with reduced noise, substantiating its artifact-suppression capabilities.
- Efficiency: Despite its sophisticated architecture, Avocodo maintains comparable inference speed and parameter efficiency relative to other models, proving advantageous in practical deployment scenarios.
Implications and Future Directions
Avocodo's development represents an essential step towards artifact-free speech synthesis in GAN-based vocoders. By incorporating sophisticated discriminator architectures and leveraging advanced filter banks like PQMF, the model addresses longstanding issues of aliasing and distortion with innovative solutions.
The broader impact of this research lies in its potential applications across diverse speech synthesis domains, such as virtual assistants, communication aids, and language translation systems. As AI technology advances, the principles and methodologies developed in Avocodo could inspire further innovations, enhancing real-time processing capabilities and adaptive learning in neural vocoders.
Future research could extend upon Avocodo's frameworks, exploring alternative architectures in GAN-based models and further refining discriminator designs to enhance robustness against diverse acoustic scenarios. Additionally, investigating the integration of these methodologies with emerging neural network paradigms may unlock new dimensions of efficiency and quality in speech technology.