Adversarial Audio Synthesis (1802.04208v3)

Published 12 Feb 2018 in cs.SD and cs.LG

Abstract: Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation. In this paper we introduce WaveGAN, a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio. WaveGAN is capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation. Our experiments demonstrate that, without labels, WaveGAN learns to produce intelligible words when trained on a small-vocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We compare WaveGAN to a method which applies GANs designed for image generation on image-like audio feature representations, finding both approaches to be promising.

Authors (3)

Chris Donahue (35 papers)
Julian McAuley (238 papers)
Miller Puckette (3 papers)

Citations (579)

View on Semantic Scholar

Summary

The paper introduces WaveGAN, a GAN framework for unsupervised synthesis of one-second raw audio waveforms, demonstrating its ability to produce intelligible speech and diverse sounds.
It adapts a DCGAN architecture with one-dimensional filters and a phase shuffle operation to effectively process temporal audio features without engineered representations.
Comparative evaluations with SpecGAN highlight WaveGAN's efficiency in generating coherent and varied audio, indicating strong potential for real-time applications.

Adversarial Audio Synthesis: An Overview

The paper "Adversarial Audio Synthesis" introduces WaveGAN, a Generative Adversarial Network (GAN) framework designed for unsupervised synthesis of raw waveform audio. This work represents a novel application of GANs, previously successful in image generation, to the domain of audio synthesis. By producing globally coherent audio slices, WaveGAN finds potential in sound effect generation and other audio synthesis applications.

Core Contributions

WaveGAN is portrayed as a method capable of generating one-second audio waveforms. Through experiments, the authors demonstrate WaveGAN's ability to produce intelligible speech, as well as to synthesize sounds from diverse domains like drums, bird vocalizations, and piano. This unsupervised model learns audio structures across various timescales, reflecting the temporal resolution characteristic of audio signals.

The authors compare WaveGAN with SpecGAN, a model using GANs on spectrogram representations of audio. Both approaches are evaluated for their efficacy in producing coherent audio samples. The findings indicate that WaveGAN, by directly operating on raw waveforms, holds substantial promise in audio synthesis without the need for engineered feature representations.

Methodology and Architecture

WaveGAN is fundamentally an adaptation of the DCGAN architecture augmented for audio data. The generator's design includes one-dimensional filters in place of the two-dimensional counterparts used in image GANs, allowing effective operation on raw audio signals. A critical innovation is the phase shuffle operation employed in the discriminator to mitigate the learning of trivial solutions and maintain phase-invariance.

SpecGAN, by contrast, operates on spectrograms with approximate invertibility, allowing GANs traditionally designed for images to function within the audio domain. The paper examines the relative strengths of waveform vs. spectrogram approaches.

Numerical Results

WaveGAN demonstrates its ability to synthesize human-recognizable spoken digits, achieving a balanced inception score and maintaining diversity in the generated outputs. It is notably quicker than autoregressive models like WaveNet when generating audio, highlighting its potential in real-time applications. Human evaluation affirms the intelligible nature of the synthesized audio despite minor qualitative differences from spectrogram-based approaches.

Implications and Future Directions

This work implies a significant step towards effective unsupervised audio synthesis using neural networks. Though the audio quality surpasses earlier methods, it recognizes the need for continued improvement, particularly in terms of output fidelity and model scalability.

The ability to generate coherent sound effects with GANs presents exciting possibilities in creative industries such as music production and film sound design. Future exploration could enhance the model's ability to handle longer audio sequences, as well as introduce conditioning mechanisms to guide synthesis towards specific audio attributes or styles.

Overall, "Adversarial Audio Synthesis" sets a foundation for future research into GAN-based frameworks for a broader array of audio synthesis tasks, potentially transforming practical applications in audio generation and manipulation.

PDF Markdown

Related Papers

GitHub

GitHub - chrisdonahue/wavegan: WaveGAN: Learn to synthesize raw audio with generative adversarial networks (1,306 stars)

Tweets

https://twitter.com/ProbablyMrJohn/status/1790609808456138929

YouTube

Show All Videos