Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Music Consistency Models (2404.13358v1)

Published 20 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.

Efficient Synthesis of High-Quality Music Clips Using Music Consistency Models (MusicCM)

Introduction

MusicCM is a novel application of consistency models, primarily used in image/video generation, applied to the domain of music synthesis. In stark contrast to traditional diffusion models that are sampling-intensive and computationally demanding, MusicCM efficiently generates high-quality music clips from text prompts with significantly fewer sampling steps. Using adversarial discriminator training and consistency distillation, MusicCM reduces the steps required for music synthesis from typical 50-step procedures to about 4 to 6 steps, demonstrating its potential for real-time applications.

Methods and Technical Innovations

MusicCM builds upon the theoretical and practical foundations laid by existing diffusion models in text-to-music synthesis, such as Noise2Music and MusicLDM, by incorporating the principles of consistency models. The primary innovations and methodological advancements of MusicCM include:

  • Consistency Distillation: This process involves training a student model (MusicCM) to mimic a teacher diffusion model, allowing the system to generate music in fewer steps while retaining the quality and characteristics achieved by the original model.
  • Adversarial Discriminator Training: In conjunction with consistency distillation, an adversarial training component compels the model to produce outputs indistinguishable from real music compositions, enhancing the naturalness and fidelity of the generated music.

Key Advantages and Performance

  1. Computational Efficiency: By reducing the sampling steps required, MusicCM presents a significant enhancement in computational efficiency, enabling faster music generation without compromising quality. It achieves seamless music synthesis with an impressive reduction in required computation, demonstrating only one second per minute of generated music clip.
  2. High Fidelity and Naturalness: The integration of adversarial training ensures that the music clips generated maintain high fidelity and exhibit natural musical qualities, as demonstrated by competitive scores on metrics like Frechet Distance and Inception Score against other state-of-the-art models.
  3. Long Music Coherence: Through the introduction of a shared restricted process for long music generation, MusicCM addresses challenges related to maintaining coherence and quality in longer music sequences. This is achieved by blending multiple diffusion processes, each constrained by shared restrictions, enhancing the final output's cohesiveness.

Future Directions and Speculations

Given its performance and efficiency, MusicCM has the potential to revolutionize real-time music synthesis applications. However, there are several avenues for future research:

  • Further exploration into optimizing the balance between the number of sampling steps and the quality of generated music.
  • Expansion of the adversarial training methods to incorporate newer and more robust discriminator models for improved fidelity in generated music.
  • Exploration of MusicCM’s application scope beyond text-to-music generation, potentially applying its principles to other areas of audio synthesis or even cross-modal generative tasks.

Conclusion

MusicCM represents a significant advancement in the domain of text-to-music generation, primarily through its innovative use of consistency models adapted from image synthesis. By significantly reducing the need for extensive sampling while ensuring high-quality output, MusicCM not only improves computational efficiency but also opens new possibilities for real-time music generation applications. As this field continues to evolve, MusicCM provides a strong foundation for future research and development in efficient and high-fidelity music generation technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. MusicLM: Generating music from text. arXiv preprint:2301.11325, 2023.
  2. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023.
  3. The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019.
  4. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. arXiv preprint arXiv:2308.01546, 2023.
  5. Progressive text-to-image generation. arXiv preprint arXiv:2210.02291, 2022.
  6. Deecap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12216–12226, 2022.
  7. A-jepa: Joint-embedding predictive architecture can listen. arXiv preprint arXiv:2311.15830, 2023.
  8. Gradient-free textual inversion. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1364–1373, 2023.
  9. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
  10. Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models. arXiv preprint arXiv:2404.04478, 2024.
  11. Zheng-cong Fei. Fast image caption generation with position alignment. arXiv preprint arXiv:1912.06365, 2019.
  12. Zhengcong Fei. Actor-critic sequence generation for relative difference captioning. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 100–107, 2020.
  13. Zhengcong Fei. Memory-augmented image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1317–1324, 2021.
  14. Zhengcong Fei. Partially non-autoregressive image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1309–1316, 2021.
  15. Riffusion - Stable diffusion for real-time music generation. 2022.
  16. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  17. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  18. Cnn architectures for large-scale audio classification. pages 131–135. IEEE, 2017.
  19. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  20. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  21. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint:2302.03917, 2023.
  22. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
  23. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint:2301.12661, 2023.
  24. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  25. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  26. HifiGAN: Generative adversarial networks for efficient and high fidelity speech synthesis. 33:17022–17033, 2020.
  27. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. 2020.
  28. AudioGen: Textually guided audio generation. 2022.
  29. Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2023.
  30. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  31. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  32. AudioLDM: Text-to-audio generation with latent diffusion models. 2023.
  33. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
  34. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  35. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023.
  36. Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023.
  37. Which training methods for gans do actually converge? In International conference on machine learning, pages 3481–3490. PMLR, 2018.
  38. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  39. MubertAI. Mubert: A simple notebook demonstrating prompt-based music generation.
  40. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  41. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  43. High-resolution image synthesis with latent diffusion models. pages 10684–10695, 2022.
  44. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  45. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  46. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  47. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  49. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1532–1540, 2021.
  50. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  52. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  53. Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023.
  54. Unlimited-size diffusion restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1160–1167, 2023.
  55. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. pages 1–5. IEEE, 2023.
  56. Ccm: Adding conditional controls to text-to-image consistency models. arXiv preprint arXiv:2312.06971, 2023.
  57. Semi-autoregressive image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2708–2716, 2021.
  58. DiffSound: Discrete diffusion model for text-to-sound generation. 2023.
  59. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  60. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  61. A survey of ai music generation tools and models. arXiv preprint arXiv:2308.12982, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhengcong Fei (27 papers)
  2. Mingyuan Fan (35 papers)
  3. Junshi Huang (24 papers)
Citations (5)