Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation (2407.20955v1)

Published 30 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Managing the emotional aspect remains a challenge in automatic music generation. Prior works aim to learn various emotions at once, leading to inadequate modeling. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework. The first stage focuses on valence modeling of lead sheet, and the second stage addresses arousal modeling by introducing performance-level attributes. To further capture features that shape valence, an aspect less explored by previous approaches, we introduce a novel functional representation of symbolic music. This representation aims to capture the emotional impact of major-minor tonality, as well as the interactions among notes, chords, and key signatures. Objective and subjective experiments validate the effectiveness of our framework in both emotional valence and arousal modeling. We further leverage our framework in a novel application of emotional controls, showing a broad potential in emotion-driven music generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music Transformer: Generating music with long-term structure,” in Proc. ICLR, 2019.
  2. Y.-S. Huang and Y.-H. Yang, “Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proc. ACM Multimed., 2020.
  3. D. von Rütte, L. Biggio, Y. Kilcher, and T. Hofmann, “FIGARO: Generating symbolic music with fine-grained artistic control,” in Proc. ICLR, 2023.
  4. K. Chen, C. Wang, T. Berg-Kirkpatrick, and S. Dubnov, “Music sketchnet: Controllable music generation via factorized representations of pitch and rhythm,” in Proc. ISMIR, 2020.
  5. S.-L. Wu and Y.-H. Yang, “MuseMorphose: Full-song and fine-grained piano music style transfer with one Transformer VAE,” IEEE Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1953–1967, 2023.
  6. P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and J. Bian, “MuseCoco: Generating symbolic music from text,” CoRR, vol. abs/2306.00110, 2023.
  7. H. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-H. Yang, “EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation,” in Proc. ISMIR, 2021.
  8. P. L. T. Neves, J. Fornari, and J. B. Florindo, “Generating music with sentiment using Transformer-GANs,” in Proc. ISMIR, 2022.
  9. S. Ji and X. Yang, “MusER: Musical element-based regularization for generating symbolic music with emotion,” in Proc. AAAI, 2024.
  10. C. Kang, P. Lu, B. Yu, X. Tan, W. Ye, S. Zhang, and J. Bian, “EmoGen: Eliminating subjective bias in emotional music generation,” CoRR, vol. abs/2307.01229, 2023.
  11. L. Ferreira and J. Whitehead, “Learning to generate music with sentiment,” in Proc. ISMIR, 2019.
  12. E. Choi, Y. Chung, S. Lee, J. Jeon, T. Kwon, and J. Nam, “YM2413-MDB: A multi-instrumental FM video game music dataset with emotion annotations,” in Proc. ISMIR, 2022.
  13. W. Cui, P. Sarmento, and M. Barthet, “MoodLoopGP: Generating emotion-conditioned loop tablature music with multi-granular features,” in Proc. EvoMUSART, 2024.
  14. K. Zheng, R. Meng, C. Zheng, X. Li, J. Sang, J. Cai, J. Wang, and X. Wang, “EmotionBox: A music-element-driven emotional music generation system based on music psychology,” Frontiers in Psychology, vol. 13, 2022.
  15. M. T. Haseeb, A. Hammoudeh, and G. Xia, “GPT-4 driven cinematic music generation through text processing,” in Proc. ICASSP, 2024.
  16. J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, 1980.
  17. Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen, “A regression approach to music emotion recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 2, pp. 448–457, 2008.
  18. ——, “Machine recognition of music emotion: A review,” ACM Trans. Intelligent Systems and Technology, vol. 3, no. 3, 2012.
  19. J. S. G. Cañón, E. Cano, T. Eerola, P. Herrera, X. Hu, Y.-H. Yang, and E. Gómez, “Music emotion recognition: Toward new, robust standards in personalized and context-sensitive applications,” IEEE Signal Process. Magzine, vol. 38, no. 6, pp. 106–114, 2021.
  20. D. R. Bakker and F. H. Martin, “Musical chords and emotion: major and minor triads are processed for emotion,” Cognitive, Affective, & Behavioral Neuroscience, 2015.
  21. Y.-C. Wu and H. H. Chen, “Generation of affective accompaniment in accordance with emotion flow,” IEEE Trans. Audio, Speech, Lang. Process., 2016.
  22. S. Chowdhury and G. Widmer, “On perceived emotion in expressive piano performance: Further experimental evidence for the relevance of mid-level perceptual features,” in Proc. ISMIR, 2021.
  23. R. Panda, R. Malheiro, and R. P. Paiva, “Audio features for music emotion recognition: A survey,” IEEE Trans. Affective Computing, 2020.
  24. A. Aljanaki and M. Soleymani, “A data-driven approach to mid-level perceptual musical feature modeling,” in Proc. ISMIR, 2018.
  25. Y. Hong, R. K. Mo, and A. Horner, “The effects of mode, pitch, and dynamics on valence in piano scales and chord progressions,” in Proc. ICMC, 2018.
  26. W. Hsiao, J. Liu, Y. Yeh, and Y.-H. Yang, “Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in Proc. AAAI, 2021.
  27. T. Chen and L. Su, “Functional harmony recognition of symbolic music data with multi-task recurrent neural networks,” in Proc. ISMIR, 2018.
  28. L. N. Ferreira, L. Mou, J. Whitehead, and L. H. S. Lelis, “Controlling perceived emotion in symbolic music generation with monte carlo tree search,” in Proc. of AAAI (AIIDE Workshop), 2022.
  29. L. N. Ferreira, L. H. S. Lelis, and J. Whitehead, “Computer-generated music for tabletop role-playing games,” in Proc. of AAAI (AIIDE Workshop, 2020.
  30. N. N. López, M. Gotham, and I. Fujinaga, “Augmentednet: A roman numeral analysis network with synthetic training examples and additional tonal tasks,” in Proc. ISMIR, 2021.
  31. G. Micchi, M. Gotham, and M. Giraud, “Not all roads lead to Rome: Pitch representation and model architecture for automatic harmonic analysis,” TISMIR, 2020.
  32. E. Karystinaios and G. Widmer, “Roman numeral analysis with graph neural networks: Onset-wise predictions from note-wise features,” in Proc. ISMIR, 2023.
  33. B. L. Sturm, J. F. Santos, O. Ben-Tal, and I. Korshunova, “Music transcription modelling and composition using deep learning,” CoRR, vol. abs/1604.08723, 2016.
  34. Y. Yeh, W. Hsiao, S. Fukayama, T. Kitahara, B. Genchel, H. Liu, H. Dong, Y. Chen, T. Leong, and Y.-H. Yang, “Automatic melody harmonization with triad chords: A comparative study,” CoRR, vol. abs/2001.02360, 2020.
  35. Y. Li, S. Li, and G. Fazekas, “An comparative analysis of different pitch and metrical grid encoding methods in the task of sequential music generation,” CoRR, vol. abs/2301.13383, 2023.
  36. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “CTRL: A conditional transformer language model for controllable generation,” CoRR, vol. abs/1909.05858, 2019.
  37. M. Mongeau and D. Sankoff, “Comparison of musical sequences,” Computers and the Humanities, vol. 24, no. 3, pp. 161–175, 1990.
  38. S.-L. Wu and Y.-H. Yang, “Compose & Embellish: Well-structured piano performance generation via a two-stage approach,” in Proc. ICASSP, 2023.
  39. “HookTheory,” https://www.hooktheory.com/ [Accessed: (September 1, 2023)].
  40. C. Donahue, J. Thickstun, and P. Liang, “Melody transcription via generative pre-training,” in Proc. ISMIR, 2022.
  41. A. L. Uitdenbogerd and J. Zobel, “Manipulation of music for melody matching,” in Proc. ACM Multimed., 1998.
  42. J. Chang, “Chorders,” https://github.com/joshuachang2311/chorder.
  43. C. L. Krumhansl, “Cognitive foundations of musical pitch,” Oxford University Press, 2001.
  44. P. Toiviainen and T. Eerola, “MIDI toolbox 1.1,” https://github.com/miditoolbox/, 2016.
  45. “Midi_Toolkit,” https://github.com/RetroCirce/Midi_Toolkit [Accessed: (September 1, 2023)].
  46. Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, and G. Xia, “POP909: A pop-song dataset for music arrangement generation,” in Proc. ISMIR, 2020.
  47. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS, 2017.
  48. Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in Proc. ACL, 2019.
  49. K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with Performers,” in Proc. ICLR, 2021.
  50. A. Holtzman, J. Buys, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. ICLR, 2019.
  51. Z. Fu, W. Lam, A. M. So, and B. Shi, “A theoretical analysis of the repetition problem in text generation,” in Proc. AAAI, 2021.
  52. J. Huang and Y.-H. Yang, “Emotion-driven melody harmonization via melodic variation and functional representation,” CoRR, vol. abs/2407.20176, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jingyue Huang (7 papers)
  2. Ke Chen (241 papers)
  3. Yi-Hsuan Yang (89 papers)
Citations (1)

Summary

Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation

The paper "Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation" addresses key challenges in the domain of automatic music generation, particularly focusing on the emotional aspect of piano music. The authors propose a two-stage framework to disentangle and model emotional valence and arousal independently, laying a foundation for more nuanced and expressive music generation systems.

Framework and Methodology

The proposed methodology begins with a two-stage process. The first stage centers on valence modeling, executed through lead sheet composition. Here, the system aims to capture the positiveness or negativeness of the emotion by generating the underlying harmonic and melodic skeleton of the music. This involves predicting key events and generating lead sheet sequences based on valence conditions.

In the second stage, arousal modeling is conducted by focusing on the performance level, which includes attributes such as tempo, dynamics, and articulation. This approach manages the energy or activation levels in music, achieving a deeper and more expressive rendering of the emotional content suggested by the lead sheet.

A novel functional representation is introduced to support this dual-stage framework. This representation is key to capturing the intricate interactions between notes, chords, and the overall key signature, which are critical for modeling tonality—a fundamental element linked with emotional valence. Utilizing a Roman numeral approach for chord notations ensures a more context-sensitive adaptation across varying key signatures, thereby enhancing the model’s capability to align musical structure with intended emotional outcomes.

Key Findings and Results

The paper reports on both objective and subjective evaluations to gauge the efficacy of their approach. Key consistency metrics were used to objectively measure how well the generated music matches the expected key signatures. This consistency is vital for maintaining musical coherence, and their methods showed significant improvements over previous representations such as REMI.

Subjectively, the authors conducted comprehensive listener tests to evaluate the emotional quality of generated music across the four quadrants of valence-arousal space. The proposed functional representation and two-stage framework outperform existing models, notably enhancing the separation and clarity of valence-driven and arousal-driven emotional cues in music.

Implications and Future Directions

Practically, this research presents a significant step forward in creating music that is not only musically coherent but also emotionally compelling. Such advancements have possible applications in areas such as music therapy, AI-driven soundtrack composition, and interactive media where emotional nuance is critical.

Theoretically, the paper underscores the importance of considering functional harmony and key-dependent musical relationships in generative models. This can spur future research aimed at exploring the emotional depth in other musical forms and traditions, potentially expanding the applicability of these methods across genres and cultural contexts.

Future work could delve further into the flexibility of emotion-driven music generation, striving for more diverse emotional expressions within any given key. Moreover, exploring how these models can learn from additional large-scale datasets and yield real-time applications would be a valuable extension of this research.

In conclusion, this paper lays out a compelling framework for disentangled emotional modeling in music generation. Its innovative use of functional representation paired with a structured, two-stage process addresses previous limitations and sets a robust foundation for future advancements in emotion-aware AI music systems.

Youtube Logo Streamline Icon: https://streamlinehq.com