EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (2402.00892v1)

Published 31 Jan 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at https://double-blind-eva-gan.cc.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces EVA-GAN, a novel scalable GAN architecture that significantly advances audio generation using an extensive 36,000-hour high-fidelity dataset.
The study employs a context-aware module and innovative training protocols, including loss balancing and longer context windows, to address spectral discontinuities and high-frequency blurriness.
Evaluations on benchmarks like LibriTTS demonstrate EVA-GAN's superior performance with marked improvements in metrics such as M-STFT and PESQ.

Overview of EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

This paper presents EVA-GAN, a scalable architecture utilizing Generative Adversarial Networks (GANs) to enhance audio generation capabilities. The primary advancements focus on high-fidelity (HiFi) audio and improve upon prevalent challenges in existing GAN-based vocoders such as spectral discontinuities and high-frequency blurriness. The authors employ a substantial dataset of 36,000 hours of 44.1kHz audio and scale the model to approximately 200 million parameters, providing robustness against out-of-domain data and elevating audio synthesis quality.

Key Contributions

Data Scaling: The research introduces the largest known compilation of high-fidelity audio data, consisting of both HiFi music and diverse broadcast audio sources, allowing for robust training across varied audio domains.
Model Architecture: EVA-GAN integrates a Context Aware Module (CAM), enhancing the model's capacity and efficiency in capturing context with minimal computational expense.
Training Innovations: The authors propose a novel training protocol that includes lengthier context windows, loss balancing techniques, and gradient checkpointing to stabilize training and manage large model sizes effectively.
Evaluation: The researchers devised a Human-In-The-Loop artifact measurement framework, incorporating a SMOS evaluation toolkit, aligning automated assessments with human perceptual standards.

Numerical and Performance Highlights

The numerical results indicate EVA-GAN's superiority in benchmark tests like LibriTTS and DSD-100 across objective and subjective evaluations. For instance, when evaluated on the LibriTTS dataset, EVA-GAN demonstrated notable improvements in metrics such as M-STFT and PESQ compared to existing models like HiFi-GAN and BigVGAN. Notably, the extended dataset and improved training pipeline contributed significantly to these outcomes.

Implications and Future Directions

EVA-GAN sets a new standard in high-fidelity audio generation, expanding the potential applications in fields such as music synthesis, virtual reality, and entertainment. This research elucidates the importance of data diversity and model scalability in neural vocoders.

The implications extend beyond the immediate results; the design choices and methodologies provide a blueprint for future GAN-based vocoders. The insights gained from this model can likely contribute to improvements in adjacent fields, including speech synthesis, voice conversion, and even broader AI challenges involving creative content generation.

Continued exploration could involve refining the discriminator's design, further optimizing computational efficiency, and developing more nuanced evaluation metrics that align even more closely with human auditory perception. The robust framework EVA-GAN introduces appears poised to influence audio generation methodologies profoundly, pushing the limits of machine-generated sound closer to indistinguishable human-level quality.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1754333317200027737

https://twitter.com/ArxivSound/status/1754526233075847676

https://twitter.com/AILucknow/status/1754406101859316180