Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion (2301.11757v3)

Published 27 Jan 2023 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Recent years have seen the rapid development of large generative models for text; however, much less research has explored the connection between text and another "language" of communication -- music. Music, much like text, can convey emotions, stories, and ideas, and has its own unique structure and syntax. In our work, we bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure. Specifically, we develop Mo^usai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. Moreover, our model features high efficiency, which enables real-time inference on a single consumer GPU with a reasonable speed. Through experiments and property analyses, we show our model's competence over a variety of criteria compared with existing music generation models. Lastly, to promote the open-source culture, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: Codes: https://github.com/archinetai/audio-diffusion-pytorch; music samples for this paper: http://bit.ly/44ozWDH; all music samples for all models: https://bit.ly/audio-diffusion.

Overview of the "Responsible NLP Research Checklist" Paper

The paper "Responsible NLP Research Checklist" serves as a detailed framework for ensuring ethical and methodological rigor in the field of NLP. The authors introduce a comprehensive checklist aligned with the Association for Computational Linguistics (ACL) code of ethics, aiming to foster responsible research practices. The checklist is an integral part of the ACL Rolling Review (ARR) process and seeks to guide researchers in acknowledging and addressing issues related to research ethics, societal impacts, and replicability.

Key Components of the Checklist

The checklist is structured into several sections, each targeting critical aspects of responsible research:

  1. General Submission Requirements: This section underscores the necessity for researchers to discuss limitations, potential risks, and the alignment of the abstract and introduction with the main claims of the paper. It encourages transparency in disclosing any possible shortcomings or hazards associated with the research.
  2. Use or Creation of Scientific Artifacts: Researchers are prompted to detail their use or creation of scientific artifacts. This includes citation of creators, discussion of licensing terms, and consideration of whether the use aligns with intended purposes. Furthermore, the checklist stresses the importance of documenting how data was sourced, anonymized, and protected.
  3. Computational Experiments: For experiments, the checklist demands thorough reporting on model parameters, computational resources, and the specifics of the experimental setup, including hyperparameter tuning. Transparent presentation of descriptive statistics and detailing the use of any existing software packages are also highlighted.
  4. Human Subjects and Annotators: When human participants or annotators are involved, the checklist mandates full disclosure of recruitment methods, consent procedures, compensation, and demographic data. Ethical approval from review boards is required to ensure compliance with ethical research standards.

Implications and Future Directions

The implementation of this checklist has significant implications for both the practical and theoretical landscapes of NLP research. Practically, it offers a standardized protocol to minimize ethical oversights and enhance the reproducibility of research findings. Theoretically, it encourages the consideration of broader societal impacts, prompting researchers to consider how their work’s deployment might affect various stakeholder groups.

Looking forward, the checklist could serve as a template for other branches of AI research, fostering a cross-disciplinary culture of transparency and responsibility. Moreover, as AI technologies evolve, the checklist may undergo revisions to accommodate new ethical challenges and technological advancements. This ongoing adaptability will be crucial in maintaining the checklist's relevance and efficacy.

Conclusion

The "Responsible NLP Research Checklist" represents a structured approach to ensuring that NLP research adheres to high ethical and methodological standards. By addressing various facets of the research process—from artifact creation to human subject involvement—the checklist provides a valuable resource for researchers committed to responsible innovation. Its integration into the ACL's review process underscores the continued importance of ethical considerations in the rapidly advancing domain of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. BBC Music Magazine. 2022. Classical music: 50 greatest composers of all time. BBC Music Magazine.
  2. Michele Berlingerio and Francesca Bonin. 2018. Towards a music-language mapping. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  3. Jeanette Bicknell. 2002. Can music convey semantic content? a kantian approach. The Journal of Aesthetics and Art Criticism, 60(3):253–261.
  4. AudioLM: A language modeling approach to audio generation. CoRR, abs/2209.03143.
  5. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  7. Antoine Caillon and Philippe Esling. 2021. RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. CoRR, abs/2111.05011.
  8. Muse: Text-to-image generation via masked generative transformers. CoRR, abs/2301.00704.
  9. Sheng-Kuan Chung. 2006. Digital storytelling in integrated arts education. The International Journal of Arts Education, 4(1):33–50.
  10. Sylvie Delacroix. 2023. Data rivers: Carving out the public domain in the age of Chat-GPT. Available at SSRN.
  11. Unsupervised audiovisual synthesis via exemplar autoencoders. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. Jukebox: A generative model for music. CoRR, abs/2005.00341.
  14. The challenge of realistic music generation: Modelling raw audio at scale. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8000–8010.
  15. CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769.
  16. Neural audio synthesis of musical notes with wavenet autoencoders.
  17. Gansynth: Adversarial neural audio synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  18. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12873–12883. Computer Vision Foundation / IEEE.
  19. European Commission. 2016. Proposal for a directive of the European parliament and of the council on copyright in the digital single market.
  20. Seth* Forsgren and Hayk* Martiros. 2022. Riffusion - Stable diffusion for real-time music generation.
  21. The exception for text and data mining (tdm) in the proposed directive on copyright in the digital single market-legal aspects. Centre for International Intellectual Property Studies (CEIPI) Research Paper, (2018-02).
  22. Mark Germer. 2011. Notes, 67(4):760–765.
  23. Learning dense representations for entity retrieval. In Computational Natural Language Learning (CoNLL).
  24. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 7616–7633. PMLR.
  25. Catch-a-waveform: Learning to generate audio from a single short example. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20916–20928.
  26. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations.
  27. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  28. Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507.
  29. Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303.
  30. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. CoRR, abs/2207.12598.
  31. Commu: Dataset for combinatorial music generation. CoRR, abs/2211.09385.
  32. Fréchet audio distance: A metric for evaluating music enhancement algorithms.
  33. Lip to speech synthesis with visual context attentional GAN. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 2758–2770.
  34. Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
  35. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.
  36. Diffwave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  37. AudioGen: Textually guided audio generation. CoRR, abs/2209.15352.
  38. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14881–14892.
  39. BDDM: bilateral denoising diffusion models for fast and high-quality speech synthesis. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  40. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11513–11522. IEEE.
  41. Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. CoRR, abs/2205.14807.
  42. Clip-event: Connecting text and images with event structures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 16399–16408. IEEE.
  43. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  44. SampleRNN: An unconditional end-to-end neural audio generation model. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  45. Rada Mihalcea and Carlo Strapparava. 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 590–599, Jeju Island, Korea. Association for Computational Linguistics.
  46. Chunked autoregressive GAN for conditional waveform synthesis. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  47. Isabel Papadimitriou and Dan Jurafsky. 2020. Learning Music Helps You Read: Using transfer to study linguistic structure in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829–6839, Online. Association for Computational Linguistics.
  48. Marco Pasini and Jan Schlüter. 2022. Musika! fast infinite waveform music generation. CoRR, abs/2208.08706.
  49. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10609–10619. IEEE.
  50. Improving language understanding by generative pre-training. Technical report, OpenAI.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  52. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125.
  53. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE.
  54. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer.
  55. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ArXiv, abs/2208.12242.
  56. Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487.
  57. Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  58. Flavio Schneider. 2023. ArchiSound: Audio generation with diffusion.
  59. Jay A Seitz. 2005. Dalcroze, the body, movement and musicality. Psychology of music, 33(4):419–435.
  60. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  61. Joseph P Swain. 1995. The concept of musical syntax. The Musical Quarterly, 79(2):281–308.
  62. Alan M. Turing. 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460.
  63. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125. ISCA.
  64. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6306–6315.
  65. Phenaki: Variable length video generation from open domain textual description. CoRR, abs/2210.02399.
  66. James Webster. 2001. Sonata form. The new Grove dictionary of music and musicians, 23:687–698.
  67. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.
  68. Diffsound: Discrete diffusion model for text-to-sound generation. CoRR, abs/2207.09983.
  69. Museformer: Transformer with fine- and coarse-grained attention for music generation. CoRR, abs/2210.10349.
  70. Scaling autoregressive models for content-rich text-to-image generation. CoRR, abs/2206.10789.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Flavio Schneider (2 papers)
  2. Ojasv Kamal (5 papers)
  3. Zhijing Jin (68 papers)
  4. Bernhard Schölkopf (412 papers)
Citations (72)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com