DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks (2008.12073v2)

Published 27 Aug 2020 in eess.AS and cs.SD

Abstract: Synthetic creation of drum sounds (e.g., in drum machines) is commonly performed using analog or digital synthesis, allowing a musician to sculpt the desired timbre modifying various parameters. Typically, such parameters control low-level features of the sound and often have no musical meaning or perceptual correspondence. With the rise of Deep Learning, data-driven processing of audio emerges as an alternative to traditional signal processing. This new paradigm allows controlling the synthesis process through learned high-level features or by conditioning a model on musically relevant information. In this paper, we apply a Generative Adversarial Network to the task of audio synthesis of drum sounds. By conditioning the model on perceptual features computed with a publicly available feature-extractor, intuitive control is gained over the generation process. The experiments are carried out on a large collection of kick, snare, and cymbal sounds. We show that, compared to a specific prior work based on a U-Net architecture, our approach considerably improves the quality of the generated drum samples, and that the conditional input indeed shapes the perceptual characteristics of the sounds. Also, we provide audio examples and release the code used in our experiments.

Authors (3)

J. Nistal (1 paper)
S. Lattner (1 paper)
G. Richard (2 papers)

Citations (51)

View on Semantic Scholar

Summary

Analysis of "DrumGAN: Synthesis of Drum Sounds with Timbral Feature Conditioning Using Generative Adversarial Networks"

This paper, authored by Javier Nistal, Stefan Lattner, and Gaël Richard, introduces "DrumGAN," a novel application of Generative Adversarial Networks (GANs) explicitly tailored for drum sound synthesis. The research takes advantage of deep learning approaches to enhance traditional drum sound synthesis techniques, which typically involve modifying low-level parameters that often lack intuitive musical meanings.

The paper revolutionizes drum sound creation by employing a Progressive Growing Wasserstein GAN (PGAN) conditioned on high-level timbral features. These features, computed with publicly available models from the Audio Commons project, allow for a more intuitive interaction by musicians when shaping drum sounds. The focus on perceptual features such as brightness, roughness, and boominess provides musicians with nuances and precision in sound sculpting that are absent in traditional synthesis models.

Key Aspects and Experiments

Architecture and Model Design: The DrumGAN architecture uses a PGAN trained on approximately 300,000 drum samples, focusing on kick, snare, and cymbal sounds. The generator is conditioned on timbral features allowing it to produce sounds with desired perceptual characteristics. A unique feature of DrumGAN is the addition of an auxiliary feature regression task to the discriminator, enforcing the generator to adhere closely to conditioning inputs.
Comparison of Models: DrumGAN's performance is assessed against a U-Net baseline method, which was also conditioned on continuous timbral features. The GAN approach demonstrated significant improvements in fidelity and perceptual quality of synthesized sounds. DrumGAN was shown to effectively utilize conditional inputs, outperforming the U-Net in both Kernel Inception Distance (KID) and Fréchet Audio Distance (FAD), indicating a better match to target distributions and higher quality sound production.
Evaluation Metrics: The model's performance was evaluated using the Inception Score (IS), KID, and FAD. DrumGAN's IS was comparable to real audio samples, suggesting clear classify-ability of synthesized drum types and robust handling of conditioning information. Additionally, feature coherence tests validated that DrumGAN maintains interpretability in controlling sound features, maintaining accuracy levels that imply effective feature manipulation during synthesis.
Qualitative Observations: DrumGAN not only generates high-quality individual samples but also encompasses a broader distribution of the drum dataset compared to the U-Net, which facilitates professional-level sound production. Subjectively, the conditional setting of DrumGAN produced sounds with superior quality than unconditioned versions, demonstrating its applicability in real-world music production environments.

Implications and Future Directions

DrumGAN's capability to generate high-fidelity drum sounds with precise, perceptual control marks it as a significant advancement in audio synthesis using GANs. By allowing musicians to interact with and generate sounds that meet specific musical intentions, DrumGAN contributes both a practical tool for audio production and a theoretical advancement in the understanding of how GANs can be utilized in creative domains like music synthesis.

The researchers propose future work to extend DrunGAN to support higher sample rates and stereo sound to align with contemporary audio production standards. They also plan to develop implementations that can integrate with existing Digital Audio Workstations (DAWs), positioning DrumGAN as a practical tool for music producers and engineers.

This paper contributes to a growing body of evidence that GAN-based methodologies can offer substantial improvements in the domain of sound synthesis, setting the stage for further explorations into other types of audio and creative signal generation tasks.

Related Papers

YouTube

Show All Videos