SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation (2405.18503v2)
Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.
- Brian. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313–326, 1982.
- Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
- Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.
- Scaling instruction-finetuned language models, 2022.
- Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106:1602 – 1614, 2011.
- Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
- Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Bigvgan: A universal neural vocoder with large-scale training. In Proc. International Conference on Learning Representation (ICLR), 2023.
- Generative adversarial networks. Proc. Advances in Neural Information Processing Systems (NeurIPS), 2014.
- Great Big Story. The magic of making sound, 2017. URL https://www.youtube.com/watch?v=UO3N_PRIgX0.
- Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
- Classifier-free diffusion guidance. In arXiv e-prints: 2207.12598, 2022.
- Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023a.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
- Taming visually guided sound generation. In British Machine Vision Conference (BMVC), 2021.
- Elucidating the design space of diffusion-based generative models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2019.
- AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
- Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In Proc. International Conference on Learning Representation (ICLR), 2024.
- Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2017.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Efficient training of audio transformers with patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757, 2022.
- Audiogen: Textually guided audio generation. Proc. International Conference on Learning Representation (ICLR), 2023.
- High-fidelity audio compression with improved rvqgan. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
- AudioLDM: Text-to-audio generation with latent diffusion models. Proc. International Conference on Machine Learning (ICML), 2023a.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
- On the variance of the adaptive learning rate and beyond. In Proc. International Conference on Learning Representation (ICLR), April 2020.
- Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956, 2024.
- Ditto: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
- High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, 2015.
- Improved techniques for training gans. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964.
- BigVSAN: Enhancing gan-based neural vocoders with slicing adversarial network. In IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representation (ICLR), 2021.
- Consistency models. Proc. International Conference on Machine Learning (ICML), 2023.
- Any-to-any generation via composable diffusion. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- WIRED. How this woman creates god of war’s sound effects, 2023. URL https://www.youtube.com/watch?v=WFVLWo5B81w.
- Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Freedom: Training-free energy-guided conditional diffusion model. Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Masked audio generation using a single non-autoregressive transformer. arXiv preprint arXiv:2401.04577, 2024.
- Koichi Saito (33 papers)
- Dongjun Kim (24 papers)
- Takashi Shibuya (32 papers)
- Chieh-Hsin Lai (32 papers)
- Zhi Zhong (14 papers)
- Yuhta Takida (32 papers)
- Yuki Mitsufuji (127 papers)