The challenge of realistic music generation: modelling raw audio at scale (1806.10474v1)

Published 26 Jun 2018 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find that they allow us to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.

Authors (3)

Sander Dieleman (29 papers)
Karen Simonyan (54 papers)
Aäron van den Oord (14 papers)

Citations (177)

View on Semantic Scholar

Summary

The paper introduces autoregressive discrete autoencoders to capture long-range dependencies in raw audio for music generation.
Hierarchical modeling extends the receptive field to nearly 25 seconds of 16 kHz audio, enhancing global musical structure.
The introduction of the Argmax Autoencoder improves model stability over traditional VQ-VAE methods in challenging datasets.

An Expert Analysis of "The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale"

The paper explores the complexities involved in the generation of realistic music using raw audio data. The paper ventures into autoregressive models applied directly to raw audio waveforms, a domain traditionally dominated by symbolic approaches such as MIDI or music scores, which abstract away the nuances that are essential for generating realistic music. The discussion acknowledges that while autoregressive models excel in generating raw audio, especially in speech, their current form struggles with capturing the long-range structural dependencies inherently present in music. This shortfall is attributed to the models' focus on local signal structure at the expense of global coherence.

Modelling Challenges and Approach

The research signifies a shift towards autoregressive discrete autoencoders (ADAs) as a means to enhance the model's ability in capturing long-range dependencies in raw audio waveforms. This approach facilitates the unconditioned generation of piano music directly from raw audio, extending stylistic consistency over a span of tens of seconds. One of the primary contributions of this paper is the introduction of the Argmax Autoencoder (AMAE), which is proposed as a more stable alternative to Vector Quantisation Variational Autoencoders (VQ-VAE), particularly under challenging dataset conditions.

Key Contributions and Findings

Autoregressive Discrete Autoencoders: The paper marks a significant contribution by presenting ADAs as a viable solution toward maintaining long-range temporal coherence. The authors demonstrate that through hierarchical structuring of the learning task, they can markedly expand the receptive fields of autoregressive models without overwhelming hardware constraints.
Empirical Demonstration: The paper empirically supports its claims by demonstrating that their models can handle receptive fields corresponding to approximately 25 seconds of audio sampled at 16 kHz. This is achieved by leveraging hierarchical decomposition of the signal into learnable components at varied scales.
Introduction of AMAE: An autencoder variant aimed at improving model stability over traditional VQ-VAEs is introduced. The AMAE model is highlighted for its reliable convergence in datasets that pose learning challenges, making it a promising alternative to the conventional methods used in this domain.

Implications and Future Directions

The implications of this research extend beyond the field of music generation. Raw audio modeling, as explored in this paper, can be applied to a broader spectrum of sequential data processing tasks, potentially transforming approaches to audio synthesis across various industries. The methodology described encourages further exploration of hierarchical and autoregressive techniques for tasks that inherently involve complex time-dependent structures.

Future research could aim to tackle the trade-off between a model's representation of local and global audio structure, as highlighted by the authors. Beyond simply enhancing receptive fields, integrating high-level conditioning information, such as composer or genre, may provide models with a more directed approach to music generation. Such advancements could pave the way for more refined and contextually aware generative models.

In conclusion, this paper signifies a notable advancement in the field of generative models of music by directly tackling the challenges associated with modeling raw audio at scale. It presents methodologies that hold the promise of significantly enhancing both the quality and applicability of generated music, and sets a foundation for future explorations into multi-instrumental or more diverse musical genres using similar models.

PDF Markdown

Related Papers

YouTube

Show All Videos