Musika! Fast Infinite Waveform Music Generation (2208.08706v1)

Published 18 Aug 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Fast and user-controllable music generation could enable novel ways of composing or performing music. However, state-of-the-art music generation systems require large amounts of data and computational resources for training, and are slow at inference. This makes them impractical for real-time interactive use. In this work, we introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and that allows for much faster than real-time generation of music of arbitrary length on a consumer CPU. We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders, then training a Generative Adversarial Network (GAN) on this representation for a particular music domain. A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time. We perform quantitative evaluations to assess the quality of the generated samples and showcase options for user control in piano and techno music generation. We release the source code and pretrained autoencoder weights at github.com/marcoppasini/musika, such that a GAN can be trained on a new music domain with a single GPU in a matter of hours.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a novel system that employs compact invertible audio representations to dramatically improve inference speeds.
The paper leverages a GAN framework to generate stylistically coherent, infinite-length waveforms while avoiding costly autoregressive methods.
The paper demonstrates effective style conditioning and user control, validated through superior FAD scores in both piano and techno music domains.

An Overview of "Musika: Fast Infinite Waveform Music Generation"

Musika proposes an innovative approach to music generation, addressing key limitations of existing methods, such as high computational demands and slow inference speeds. Developed by Pasini and Schlüter at Johannes Kepler University, this system facilitates both unconditional and conditional audio generation of arbitrary length, functioning efficiently on consumer-grade hardware. The architecture leverages adversarial autoencoders and GANs to generate high-quality music that maintains stylistic coherence over extended periods.

Key Contributions of Musika

The paper outlines several novel contributions that distinguish Musika from its predecessors:

Compact Invertible Representations: The system utilizes a raw audio autoencoder to encode samples into low-dimensional representations, substantially improving inference speed. This encoding focuses on spectrogram magnitudes and phases, addressing the challenges of high temporal resolution waveforms.
Generative Adversarial Network (GAN) Application: Musika employs a GAN to model these encoded representations. This approach circumvents the computational inefficiencies of autoregressive models, enabling fast and parallelized generation.
Latent Coordinate System: Introducing a latent coordinate system allows Musika to generate infinite-length audio while maintaining coherence in style, ensuring patches can be concatenated seamlessly along the temporal axis.
Style Conditioning and User Control: The system supports both unconditional and conditional generation models, allowing users to influence outputs with various conditioning signals like note density and tempo. This user control is demonstrated in the generation of piano and techno music.

Evaluation and Results

The efficacy of Musika is demonstrated through extensive testing. Notably, the paper highlights its exceptional generation speed, reporting outputs that are hundreds of times faster than real-time production, both on GPUs and CPUs. Quantitative evaluations, including the Frechét Audio Distance (FAD), reveal superior quality in comparison to other state-of-the-art systems like UNAGAN, particularly in the context of piano music generation.

Additionally, Musika showcases its adaptability across different music domains. By training on datasets such as MAESTRO and techno samples from platforms like Jamendo, the system generates stylistically coherent and musically complex compositions. The conditional models adeptly modulate compositions in response to user input, exemplifying the system's practical applications in interactive music creation.

Implications and Future Directions

Musika's contributions have significant implications for real-time music composition and performance technology. The ability to generate high-quality, stylistically consistent music at practical speeds opens avenues for applications in entertainment, virtual reality, and human-computer interaction. Furthermore, the system provides an accessible platform for future research into controllable music generation and AI creativity.

Potential areas for future exploration include enhancing the system's adaptability to more diverse musical genres and refining the conditioning signals for more intricate user control. Additionally, integrating more complex models of musical structure and dynamics could enhance the system's output, offering deeper insights into the possibilities of AI-assisted music co-creation.

In summary, Musika represents a substantial step forward in the field of music generation, addressing fundamental challenges and paving the way for more interactive, real-time applications in AI-driven creativity.

PDF Markdown

Related Papers

GitHub

GitHub - marcoppasini/musika: Fast Infinite Waveform Music Generation (676 stars)