Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation (2405.18503v2)

Published 28 May 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Brian. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313–326, 1982.
  2. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
  3. Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.
  4. Scaling instruction-finetuned language models, 2022.
  5. Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106:1602 – 1614, 2011.
  6. Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
  7. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  8. Bigvgan: A universal neural vocoder with large-scale training. In Proc. International Conference on Learning Representation (ICLR), 2023.
  9. Generative adversarial networks. Proc. Advances in Neural Information Processing Systems (NeurIPS), 2014.
  10. Great Big Story. The magic of making sound, 2017. URL https://www.youtube.com/watch?v=UO3N_PRIgX0.
  11. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  12. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
  13. Classifier-free diffusion guidance. In arXiv e-prints: 2207.12598, 2022.
  14. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023a.
  15. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
  16. Taming visually guided sound generation. In British Machine Vision Conference (BMVC), 2021.
  17. Elucidating the design space of diffusion-based generative models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  18. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2019.
  19. AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
  20. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In Proc. International Conference on Learning Representation (ICLR), 2024.
  21. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2017.
  22. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  23. Efficient training of audio transformers with patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757, 2022.
  24. Audiogen: Textually guided audio generation. Proc. International Conference on Learning Representation (ICLR), 2023.
  25. High-fidelity audio compression with improved rvqgan. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  26. Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
  27. AudioLDM: Text-to-audio generation with latent diffusion models. Proc. International Conference on Machine Learning (ICML), 2023a.
  28. AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  29. On the variance of the adaptive learning rate and beyond. In Proc. International Conference on Learning Representation (ICLR), April 2020.
  30. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  31. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956, 2024.
  32. Ditto: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
  33. High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  34. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, 2015.
  35. Improved techniques for training gans. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2016.
  36. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964.
  37. BigVSAN: Enhancing gan-based neural vocoders with slicing adversarial network. In IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024.
  38. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), 2015.
  39. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  40. Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representation (ICLR), 2021.
  41. Consistency models. Proc. International Conference on Machine Learning (ICML), 2023.
  42. Any-to-any generation via composable diffusion. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  43. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  44. WIRED. How this woman creates god of war’s sound effects, 2023. URL https://www.youtube.com/watch?v=WFVLWo5B81w.
  45. Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.
  46. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  47. Freedom: Training-free energy-guided conditional diffusion model. Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  48. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  49. Masked audio generation using a single non-autoregressive transformer. arXiv preprint arXiv:2401.04577, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Koichi Saito (33 papers)
  2. Dongjun Kim (24 papers)
  3. Takashi Shibuya (32 papers)
  4. Chieh-Hsin Lai (32 papers)
  5. Zhi Zhong (14 papers)
  6. Yuhta Takida (32 papers)
  7. Yuki Mitsufuji (127 papers)
Citations (3)

Summary

Sound Consistency Trajectory Models (SoundCTM)

The paper introduces Sound Consistency Trajectory Models (SoundCTM), a novel approach for text-to-sound (T2S) generation aimed at addressing the high inference latency typically associated with diffusion-based sound generation models. SoundCTM enables flexible transitioning between high-quality one-step sound generation and superior multi-step sound generation, providing creators with an efficient and versatile tool for real-time sound synthesis.

Background and Challenges

Recent advancements in diffusion-based models have demonstrated significant promise in generating high-quality sounds for multimedia applications. However, the iterative sampling process inherent in these models results in slow inference speeds. This latency is particularly burdensome for sound creators who require rapid feedback to refine and align sounds with their artistic intentions. Addressing the slow inference problem is crucial for making these models more practical and appealing to sound creators.

SoundCTM: A Novel Framework

SoundCTM offers a solution by allowing flexible switching between one-step high-quality sound generation and higher-quality multi-step generation. The framework introduces several innovations:

  1. Feature Distance from Teacher's Network: To improve the performance without the need for expensive pretrained feature extractors or adversarial loss, SoundCTM utilizes the teacher's network to derive a novel feature distance for a distillation loss, optimizing memory usage and performance.
  2. Classifier-Free Guided Trajectories: The framework distills classifier-free guided text-conditional trajectories, simultaneously training conditional and unconditional student models.
  3. Interpolation During Inference: During sampling, SoundCTM leverages a new scaling term to interpolate between text-conditional and unconditional neural jumps, enhancing the flexibility and quality of the generated sounds.

Experimental Results

The paper reports comprehensive experiments demonstrating SoundCTM's effectiveness across various metrics such as Frechet Audio Distance (FAD), Inception Score (IS), and CLAP score. Key findings include:

  • High-Quality One-Step Generation: SoundCTM's one-step generation achieves a FAD of 2.17, outperforming other models like ConsistencyTTA.
  • Flexible Multi-Step Generation: With 16-step sampling, SoundCTM achieves superior performance, showcasing FAD improvements and real-time generation capabilities on both GPU and CPU platforms.
  • Training-Free Controllable Generation: SoundCTM supports training-free controllable sound generation, leveraging its anytime-to-anytime jump capability to optimize initial noise with significant efficiency.

Implications and Future Developments

The introduction of SoundCTM holds several implications for both practical applications and theoretical developments in sound generation:

  1. Real-Time Sound Synthesis: By addressing the issue of slow inference, SoundCTM can significantly enhance the efficiency of sound creation workflows, making it a valuable tool for Foley artists and multimedia content creators.
  2. Versatility Across Modalities: The domain-agnostic nature of the proposed framework suggests potential applicability to other modalities beyond sound, paving the way for broader adoption in multimedia generation tasks.
  3. Dynamic Sound Generation: The ability to achieve real-time dynamic sound generation opens new possibilities for live performances, interactive exhibitions, and immersive video game experiences.

Future research could further explore the integration of SoundCTM with other state-of-the-art models and techniques, as well as potential applications beyond the current scope. Enhancing the interpretability of the generated sounds and improving the robustness of the framework in diverse environments are also promising directions.

In conclusion, SoundCTM presents a significant step forward in the evolution of sound generation models, offering a blend of flexibility, efficiency, and high-quality output. The paper provides valuable insights and practical solutions that address key challenges in the field, making it a notable contribution to the ongoing development of advanced sound synthesis technologies.