MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment (1709.06298v2)

Published 19 Sep 2017 in eess.AS, cs.AI, cs.LG, cs.SD, and stat.ML

Abstract: Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, and thereby introducing a chronological ordering of notes is not naturally suitable. In this paper, we propose three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs). The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model. We trained the proposed models on a dataset of over one hundred thousand bars of rock music and applied them to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings. A few intra-track and inter-track objective metrics are also proposed to evaluate the generative results, in addition to a subjective user study. We show that our models can generate coherent music of four bars right from scratch (i.e. without human inputs). We also extend our models to human-AI cooperative music generation: given a specific track composed by human, we can generate four additional tracks to accompany it. All code, the dataset and the rendered audio samples are available at https://salu133445.github.io/musegan/ .

Citations (517)

View on Semantic Scholar

Summary

The paper introduces MuseGAN, a framework using GANs (WGAN-GP) to generate multi-track, polyphonic symbolic music from scratch or as accompaniment.
The paper details three model architectures—jamming, composer, and hybrid—balancing independent track generation with coordinated harmonic output.
The study employs both objective metrics and user studies to evaluate musical quality, demonstrating practical implications for AI-assisted music composition.

Overview of MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment

This paper introduces MuseGAN, a framework employing generative adversarial networks (GANs) tailored for the generation of symbolic multi-track music. The paper delineates three distinct models within this framework: the jamming model, the composer model, and the hybrid model, each varying in strategy for tackling the challenges inherent in music generation.

Key Features of MuseGAN

Generative Framework: MuseGAN leverages GANs, specifically using the Wasserstein GAN with Gradient Penalty (WGAN-GP) for stability and robustness during training.
Multi-track Music Generation: The models are designed to handle polyphonic music across multiple tracks simultaneously, generating music from scratch or adding accompaniment to existing tracks.
Data Representation: The system utilizes a multi-track piano-roll representation, considering bars as the fundamental unit, which supports complex harmonic and rhythmic interdependencies between tracks.
Objective Metrics: The paper proposes specific intra-track and inter-track metrics to evaluate the generative quality, such as empty bars ratio, used pitch classes per bar, and tonal distances.

Model Architectures

Jamming Model: Independent generators for each track, leading to potential issues in harmonized inter-track dependencies but with a relatively straightforward implementation.
Composer Model: A single generator creates a unified multi-track output, emphasizing network simplicity and coordination at the cost of flexibility for individual track variance.
Hybrid Model: Integrates concepts from the jamming and composer models, offering independent generators with shared inputs to maintain harmony and diversity.

Temporal Structure Handling

The models are extended to incorporate temporal dynamics using two approaches:

Generation from Scratch: Employing a temporal generator to establish sequential coherence across bars.
Track-Conditional Generation: Utilizing an encoder to generate complementary tracks from an existing human-provided melody or rhythm base.

Dataset and Evaluation

The research employs the Lakh Pianoroll Dataset (LPD) for training, derived from the Lakh MIDI Dataset with specific preprocessing to ensure data consistency. The evaluation leverages both objective metrics and a subjective user paper involving 144 listeners, yielding insights into the perceptual quality of the generated music.

Implications and Future Directions

The work on MuseGAN indicates significant possibilities for AI in multi-track music generation:

Practical Implications: MuseGAN provides a robust framework for music creation tools and could be integrated into digital audio workstations to aid musicians in composing and arranging music.
Theoretical Implications: This research advances our understanding of using GANs in sequential domains beyond text and speech, highlighting the adaptability of GANs for time-dependent data.

Future Directions

Future research could explore enhancing the quality of output by improving the binarization process of the generated piano-rolls and further refining the objective evaluation metrics. Additionally, expanding the model to handle more diverse and larger datasets could further enhance its generative capabilities.

In conclusion, MuseGAN represents a substantial contribution to the field of symbolic music generation, showcasing the potential of GANs in generating complex, polyphonic, and multi-instrument music.

PDF Markdown