Music Consistency Models (2404.13358v1)
Abstract: Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
- MusicLM: Generating music from text. arXiv preprint:2301.11325, 2023.
- Multidiffusion: Fusing diffusion paths for controlled image generation. 2023.
- The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019.
- Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. arXiv preprint arXiv:2308.01546, 2023.
- Progressive text-to-image generation. arXiv preprint arXiv:2210.02291, 2022.
- Deecap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12216–12226, 2022.
- A-jepa: Joint-embedding predictive architecture can listen. arXiv preprint arXiv:2311.15830, 2023.
- Gradient-free textual inversion. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1364–1373, 2023.
- Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
- Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models. arXiv preprint arXiv:2404.04478, 2024.
- Zheng-cong Fei. Fast image caption generation with position alignment. arXiv preprint arXiv:1912.06365, 2019.
- Zhengcong Fei. Actor-critic sequence generation for relative difference captioning. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 100–107, 2020.
- Zhengcong Fei. Memory-augmented image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1317–1324, 2021.
- Zhengcong Fei. Partially non-autoregressive image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1309–1316, 2021.
- Riffusion - Stable diffusion for real-time music generation. 2022.
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Cnn architectures for large-scale audio classification. pages 131–135. IEEE, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint:2302.03917, 2023.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint:2301.12661, 2023.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- HifiGAN: Generative adversarial networks for efficient and high fidelity speech synthesis. 33:17022–17033, 2020.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. 2020.
- AudioGen: Textually guided audio generation. 2022.
- Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2023.
- Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
- AudioLDM: Text-to-audio generation with latent diffusion models. 2023.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
- Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023.
- Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023.
- Which training methods for gans do actually converge? In International conference on machine learning, pages 3481–3490. PMLR, 2018.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- MubertAI. Mubert: A simple notebook demonstrating prompt-based music generation.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- High-resolution image synthesis with latent diffusion models. pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
- Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1532–1540, 2021.
- Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023.
- Unlimited-size diffusion restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1160–1167, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. pages 1–5. IEEE, 2023.
- Ccm: Adding conditional controls to text-to-image consistency models. arXiv preprint arXiv:2312.06971, 2023.
- Semi-autoregressive image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2708–2716, 2021.
- DiffSound: Discrete diffusion model for text-to-sound generation. 2023.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- A survey of ai music generation tools and models. arXiv preprint arXiv:2308.12982, 2023.
- Zhengcong Fei (27 papers)
- Mingyuan Fan (35 papers)
- Junshi Huang (24 papers)