DDSP: Differentiable Digital Signal Processing (2001.04643v1)

Published 14 Jan 2020 in cs.LG, cs.SD, eess.AS, eess.SP, and stat.ML

Abstract: Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.

PDF Abstract

Differentiable Digital Signal Processing: Enhancing Audio Synthesis

The paper "DDSP: Differentiable Digital Signal Processing" introduces the Differentiable Digital Signal Processing (DDSP) library, aiming to integrate classical signal processing techniques with deep learning methodologies. This approach is pertinent to audio synthesis, enabling efficient and high-quality generative modeling without the reliance on large models or complex loss functions.

Core Concept

Traditional audio generative models typically operate within the time or frequency domain, which, while comprehensive, often miss leveraging domain-specific knowledge about sound production and perception. DDSP addresses this by incorporating signal processing concepts such as oscillators and envelopes directly into neural networks, taking advantage of frameworks like TensorFlow for automatic differentiation.

Methodology

The DDSP framework focuses on combining interpretable modules with neural networks, facilitating manipulation and control over specific aspects of audio synthesis. This integration allows for functionalities like pitch and loudness control, extrapolation to unseen pitches, room acoustic modeling, and timbre transformation—all of which are challenging to achieve with end-to-end neural models traditionally used in synthesis.

Key Components

Harmonic Oscillator/ Additive Synthesizer: This is central to the DDSP architecture, generating audio through a summation of sinusoids with harmonic frequencies. Its parameters can be manipulated independently to affect different audio characteristics.
Filtered Noise Synthesizer: Utilizes FIR filters applied to noise streams, enabling the modeling of both harmonic and stochastic audio components.
Reverb and Acoustic Factors: Implemented via convolution in the frequency domain, this allows explicit modeling of room acoustics, separate from source audio, enhancing interpretability and control.

Experimental Validation

The DDSP models were deployed on datasets such as NSynth and a collection of solo violin performances. Two variants were explored: supervised and unsupervised DDSP autoencoders. The supervised variant used pretrained networks to extract fundamental frequency and loudness, simplifying the synthesis task for the neural network. The results showcased the ability of DDSP models to produce high-fidelity audio, outperforming state-of-the-art models in aspects such as pitch accuracy, with smaller model sizes.

Implications and Future Directions

The DDSP framework represents a meaningful shift in audio synthesis, marrying traditional signal processing techniques with modern deep learning, thereby enhancing both efficiency and control over the synthesis process. It enables the disentanglement of audio features, allowing independent manipulation of characteristics such as pitch and loudness, and facilitating tasks such as dereverberation and timbre transfer.

The potential for DDSP extends beyond audio synthesis, offering possibilities for broader applications across generative tasks that can benefit from incorporating domain-specific structural knowledge into learning architectures. Future research could explore expanding the library's scope to additional signal processing elements and potential cross-domain applications, enriching both theoretical understanding and practical capabilities in AI-driven signal processing tasks.

The paper's contribution is significant in showcasing how the integration of structured inductive biases, via signal processing components into neural networks, can yield powerful, interpretable models that maintain the expressive potential of deep learning. It invites further exploration and adaptation by researchers and practitioners across domains, indicating a promising trajectory for future developments in AI and digital signal processing.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jesse Engel (30 papers)
Lamtharn Hantrakul (4 papers)
Chenjie Gu (10 papers)
Adam Roberts (46 papers)

Citations (345)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - magenta/ddsp: DDSP: Differentiable Digital Signal Processing (2,807 stars)

Tweets

https://twitter.com/OrianSharoni/status/1783865280516469088

YouTube

Show All Videos