Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2206.13404v3)

Published 27 Jun 2022 in eess.AS, cs.AI, and cs.SD

Abstract: Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental results, Avocodo outperforms baseline GAN-based vocoders, both objectively and subjectively, while reproducing speech with fewer artifacts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Taejun Bak (4 papers)
  2. Junmo Lee (10 papers)
  3. Hanbin Bae (9 papers)
  4. Jinhyeok Yang (8 papers)
  5. Jae-Sung Bae (11 papers)
  6. Young-Sun Joo (7 papers)
Citations (25)

Summary

  • The paper introduces Avocodo, a GAN-based vocoder that eliminates aliasing and imaging artifacts in speech synthesis through innovative discriminator designs.
  • It employs a collaborative multi-band discriminator and a sub-band discriminator with PQMF to enhance spectral accuracy and mitigate artifacts.
  • Experimental results demonstrate superior MOS, reduced F0 RMSE, and efficient inference compared to state-of-the-art models, highlighting its practical impact.

Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

The paper presents Avocodo, a novel approach using Generative Adversarial Networks (GANs) to create artifact-free vocoders that effectively synthesize high-fidelity speech. GAN-based vocoders, favored for their fast inference and high-quality outputs, face challenges with artifacts like aliasing and imaging distortions, primarily due to their focus on low-frequency bands in multi-scale analyses. The authors identify these issues through preliminary experiments and propose a solution with Avocodo that incorporates unique discriminative methodologies to address these challenges.

Key Innovations

The authors introduce two main innovations in Avocodo:

  1. Collaborative Multi-Band Discriminator (CoMBD):
    • This discriminator uses a collaborative structure integrating multi-scale and hierarchical evaluations. It focuses on both full-resolution waveforms and intermediate outputs, thereby helping the generator manage spectral features across multiple resolutions and mitigate upsampling artifacts.
    • Multi-scale evaluations focus on low-frequency bands crucial for perceptual speech quality, while the hierarchical structure enhances this approach by utilizing intermediate outputs to guide the generator in balancing expansion and filtering, effectively addressing the artifact causes.
  2. Sub-Band Discriminator (SBD):
    • SBD operates on multiple sub-band signals, decomposed via a pseudo quadrature mirror filter bank (PQMF). This exploration in the frequency domain across time and frequency axes allows the generator to focus on a comprehensive speech spectrum.
    • Utilizing PQMF prevents aliasing in the downsampling process, a previously observed cause of harmonic distortion and inaccuracy in fundamental frequency (F\textsubscript{0}).

Experimental Insights

The paper provides a thorough comparison between Avocodo and existing state-of-the-art models like HiFi-GAN, MelGAN, and VocGAN across single and unseen speaker datasets. Avocodo demonstrates superior performance both subjectively and objectively. Notably:

  • MOS Analysis: Avocodo achieves higher MOS across both single-speaker and unseen speaker assignments, demonstrating consistent speech synthesis quality and robustness against variability in speaker inputs.
  • Artifact Mitigation: Objective metrics like F\textsubscript{0} RMSE and PESQ highlight Avocodo's proficiency in synthesizing accurate frequency components with reduced noise, substantiating its artifact-suppression capabilities.
  • Efficiency: Despite its sophisticated architecture, Avocodo maintains comparable inference speed and parameter efficiency relative to other models, proving advantageous in practical deployment scenarios.

Implications and Future Directions

Avocodo's development represents an essential step towards artifact-free speech synthesis in GAN-based vocoders. By incorporating sophisticated discriminator architectures and leveraging advanced filter banks like PQMF, the model addresses longstanding issues of aliasing and distortion with innovative solutions.

The broader impact of this research lies in its potential applications across diverse speech synthesis domains, such as virtual assistants, communication aids, and language translation systems. As AI technology advances, the principles and methodologies developed in Avocodo could inspire further innovations, enhancing real-time processing capabilities and adaptive learning in neural vocoders.

Future research could extend upon Avocodo's frameworks, exploring alternative architectures in GAN-based models and further refining discriminator designs to enhance robustness against diverse acoustic scenarios. Additionally, investigating the integration of these methodologies with emerging neural network paradigms may unlock new dimensions of efficiency and quality in speech technology.