This Time with Feeling: Learning Expressive Musical Performance (1808.03715v1)

Published 10 Aug 2018 in cs.SD, cs.LG, and eess.AS

Abstract: Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct $\it performance$ generation: jointly predicting the notes $\it and$ $\it also$ their expressive timing and dynamics. We consider the significance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.

Authors (5)

Sageev Oore (26 papers)
Ian Simon (16 papers)
Sander Dieleman (29 papers)
Douglas Eck (24 papers)
Karen Simonyan (54 papers)

Citations (201)

View on Semantic Scholar

Summary

Insights from "This Time with Feeling: Learning Expressive Musical Performance"

The paper "This Time with Feeling: Learning Expressive Musical Performance" focuses on the complex task of generating music through machine learning, specifically targeting the nuanced area of performance. The research delineates the intricacies involved when moving beyond mere composition to the simultaneous generation of music and its expressive attributes, such as timing and dynamics.

Key Contributions

The authors propose a shift from generating musical scores or their interpretations to creating direct performances with expressive nuances. They employ RNN-based models to target this domain and provide a clear delineation of the qualities and significance of the data required for successful modeling. Crucially, this work underscores the importance of expressing accurate timing and dynamics as essential dimensions of a musical performance that resonate with listeners' perceptual experiences.

Data Characteristics

The paper leverages the International Piano-e-Competition dataset, comprising approximately 1400 professional piano performances. This choice exemplifies the necessity for homogeneous, high-quality, expert-level data to train models capable of generating musically rich outputs. The dataset's strengths lie in its consistency—comprising solo classical piano performances recorded using advanced MIDI apparatus, thereby maintaining a balance between human expressive dynamics and digital precision.

Methodological Approach

The research utilizes a Long Short-Term Memory (LSTM) architecture within a recurrent neural network framework to model temporal dependencies in music. The novel representation of musical data includes MIDI events such as note-on/off states, time shifts, and velocity entries, permitting the model to capture micro-dynamics and timing intricacies that are beyond the grasp of traditional score-based representations. This approach allows the generated sequences to exhibit expressive features akin to human performances, a critical advancement over more static, score-derived outputs.

Results and Implications

Subjective evaluations highlight the model's capability to produce compelling, human-like piano performances. Professional composers and musicians provided feedback indicating that while long-term compositional coherence remains a challenge, the system excels at crafting realistic cadence and dynamics. Such feedback emphasizes the potential of this approach for practical applications in music generation and its alignment with human aesthetic sensibilities.

Conclusion and Future Directions

This paper opens a dialogue on the importance of direct performance generation with expressiveness as an evaluative factor. The current lack of robust, long-term compositional intelligence in musical AI systems is acknowledged, yet this paper represents a substantive step toward filling this gap. Future research may continue to develop models that integrate persistent structural elements over extended time frames, eventually leading to systems that not only perform with emotional richness but do so within the coherent narrative structures characteristic of expert human composition.

PDF Markdown