Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StemGen: A music generation model that listens (2312.08723v2)

Published 14 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “WaveNet: A Generative Model for Raw Audio,” arXiv, 2016, 1609.03499.
  2. “High Fidelity Neural Audio Compression,” arXiv, 2022, 2210.13438.
  3. “High-Fidelity Audio Compression with Improved RVQGAN,” arXiv, 2023, 2306.06546.
  4. “MusicLM: Generating Music From Text,” arXiv, 2023, 2301.11325.
  5. “Simple and Controllable Music Generation,” arXiv, 2023, 2306.05284.
  6. “VampNet: Music Generation via Masked Acoustic Token Modeling,” arXiv, 2023, 2307.04686.
  7. “Noise2Music: Text-conditioned Music Generation with Diffusion Models,” arXiv, 2023, 2302.03917.
  8. “Multi-instrument Music Synthesis with Spectrogram Diffusion,” arXiv, 2022, 2206.05408.
  9. “Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion,” arXiv, 2023, 2301.11757.
  10. Nicholas Cook, Music, Imagination, and Culture, ACLS Humanities E-Book. Clarendon Press, 1990.
  11. “Jukebox: A Generative Model for Music,” arXiv, 2020, 2005.00341.
  12. “SingSong: Generating musical accompaniments from singing,” arXiv, 2023, 2301.12662.
  13. “Multi-Source Diffusion Models for Simultaneous Music Generation and Separation,” arXiv, 2023, 2302.02257.
  14. “SoundStorm: Efficient Parallel Audio Generation,” arXiv, 2023, 2305.09636.
  15. “CLAP: Learning Audio Concepts From Natural Language Supervision,” arXiv, 2022, 2206.04769.
  16. J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” arXiv, 2022, 2207.12598.
  17. “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
  18. MIDI Manufacturers Association, “Complete MIDI 1.0 Detailed Specification,” http://www.midi.org/techspecs/gm.php, 1999/2008.
  19. “LLaMA: Open and Efficient Foundation Language Models,” arXiv, 2023, 2302.13971.
  20. “Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,” in Proc. Interspeech 2019, 2019, pp. 2350–2354.
  21. M. T. Pearce, The construction and evaluation of statistical models of melodic structure in music perception and composition, Ph.D. thesis, City University London, 2005.
  22. M. Pearce and G. Wiggins, “Expectation in melody: The influence of context and learning,” Music Perception, vol. 23, pp. 377–405, 06 2006.
  23. L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, vol. 32, 05 2020.
  24. “Multitrack Music Transcription with a Time-Frequency Perceiver,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2023, pp. 1–5.
  25. “Modeling Beats and Downbeats with a Time-Frequency Transformer,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2022, pp. 401–405.
  26. W.-T. Lu and J.-C. Wang and M. Won and K. Choi and X. Song, “SpecTNT: a Time-Frequency Transformer for Music Audio,” in International Society for Music Information Retrieval Conference, 2021.
  27. “To Catch A Chorus, Verse, Intro, or Anything Else: Analyzing a Song with Structural Functions,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2022, pp. 416–420.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Julian D. Parker (6 papers)
  2. Janne Spijkervet (5 papers)
  3. Katerina Kosta (2 papers)
  4. Furkan Yesiler (6 papers)
  5. Boris Kuznetsov (3 papers)
  6. Ju-Chiang Wang (24 papers)
  7. Matt Avent (2 papers)
  8. Jitong Chen (15 papers)
  9. Duc Le (46 papers)
Citations (18)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com