Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network (1811.04357v1)

Published 11 Nov 2018 in cs.SD, cs.MM, and eess.AS

Abstract: Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the piano rolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between piano rolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two commercial sound libraries. We open our source code at https://github.com/bwang514/PerformanceNet

Citations (36)

Summary

  • The paper introduces an end-to-end deep learning model that directly converts pianoroll representations into lifelike audio performances.
  • The architecture features two subnets—ContourNet for precise score-to-spectrogram translation and TextureNet for enhancing spectral details.
  • User studies show that PerformanceNet outperforms conventional synthesizers in naturalness and emotional expressivity.

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

In the domain of artificial intelligence and music synthesis, the paper "PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network" by Bryan Wang and Yi-Hsuan Yang presents a novel approach to automatic music performance generation. Distinct from traditional approaches that primarily focus on symbolic music generation, this paper offers an end-to-end deep learning framework designed to translate musical scores directly into audio. By leveraging a convolutional network architecture, the authors aim to render musical scores with an authentic performance quality, complete with nuanced timing and dynamic expressions.

Methodological Framework

The proposed system, PerformanceNet, utilizes a deep convolutional neural network to learn the score-to-audio mapping by transforming pianoroll representations into spectrograms, which are then converted into audio. This mapping process consists of two primary subnets:

  1. ContourNet: This U-Net-based subnet addresses the initial score-to-spectrogram translation, capturing the fundamental pitch and timing information from the input pianorolls. The ContourNet incorporates an onset and offset encoder to refine the model’s ability to discern the beginnings and endings of notes, crucial for realistic music performance synthesis.
  2. TextureNet: Developed to complement the ContourNet, the TextureNet employs a multi-band residual design to refine the spectrogram, enhancing the spectral resolution progressively across frequency bands. This component mimics a form of image super-resolution, but tailored specifically for audio spectral textures.

Results and Observations

The paper's empirical evaluations are underscored by user studies benchmarking the proposed system against existing synthesizers and a WaveNet-based model. The findings reveal that PerformanceNet consistently outperforms competitors in perceived naturalness and emotional expressivity. Participants noted higher mean opinion scores for PerformanceNet in replicating the sound of instruments like cello, violin, and flute, suggesting a closer approximation to human performance attributes than traditional synthesis methods.

Implications and Future Prospects

The implications of this work are substantial for both practical applications and theoretical advancements in AI-driven music synthesis. By providing a method that surpasses existing synthesizers in expressive capabilities, PerformanceNet could influence automated composition, real-time accompaniment in digital systems, and virtual instrument development. Moreover, its data-efficient model structure and ability to simulate nuanced musical performance could serve as a cornerstone for further exploration in creating AI models that can imbue digital scores with performer-specific stylizations and emotional undertones.

Looking ahead, potential research directions include improving spectral timbre quality through methods like GANs or refining latent space representations to better capture personal attributes of performers. Additionally, extending the framework to cater to polyphonic and multi-instrument compositions could significantly broaden its applicability. Moreover, integrating psychoacoustic modeling to further enhance audio realism, or exploring its utility in cross-modal generative tasks, presents exciting opportunities for future research endeavors in AI music systems.

Overall, "PerformanceNet" marks a commendable step in the music AI field, shifting attention towards not just generating music structurally, but performing it with human-like musicality.