SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation (2405.18503v2)

Published 28 May 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.

References (49)

Authors (7)

Koichi Saito (33 papers)
Dongjun Kim (24 papers)
Takashi Shibuya (32 papers)
Chieh-Hsin Lai (32 papers)
Zhi Zhong (14 papers)
Yuhta Takida (32 papers)
Yuki Mitsufuji (127 papers)

Citations (3)

View on Semantic Scholar

Summary

Sound Consistency Trajectory Models (SoundCTM)

The paper introduces Sound Consistency Trajectory Models (SoundCTM), a novel approach for text-to-sound (T2S) generation aimed at addressing the high inference latency typically associated with diffusion-based sound generation models. SoundCTM enables flexible transitioning between high-quality one-step sound generation and superior multi-step sound generation, providing creators with an efficient and versatile tool for real-time sound synthesis.

Background and Challenges

Recent advancements in diffusion-based models have demonstrated significant promise in generating high-quality sounds for multimedia applications. However, the iterative sampling process inherent in these models results in slow inference speeds. This latency is particularly burdensome for sound creators who require rapid feedback to refine and align sounds with their artistic intentions. Addressing the slow inference problem is crucial for making these models more practical and appealing to sound creators.

SoundCTM: A Novel Framework

SoundCTM offers a solution by allowing flexible switching between one-step high-quality sound generation and higher-quality multi-step generation. The framework introduces several innovations:

Feature Distance from Teacher's Network: To improve the performance without the need for expensive pretrained feature extractors or adversarial loss, SoundCTM utilizes the teacher's network to derive a novel feature distance for a distillation loss, optimizing memory usage and performance.
Classifier-Free Guided Trajectories: The framework distills classifier-free guided text-conditional trajectories, simultaneously training conditional and unconditional student models.
Interpolation During Inference: During sampling, SoundCTM leverages a new scaling term to interpolate between text-conditional and unconditional neural jumps, enhancing the flexibility and quality of the generated sounds.

Experimental Results

The paper reports comprehensive experiments demonstrating SoundCTM's effectiveness across various metrics such as Frechet Audio Distance (FAD), Inception Score (IS), and CLAP score. Key findings include:

High-Quality One-Step Generation: SoundCTM's one-step generation achieves a FAD of 2.17, outperforming other models like ConsistencyTTA.
Flexible Multi-Step Generation: With 16-step sampling, SoundCTM achieves superior performance, showcasing FAD improvements and real-time generation capabilities on both GPU and CPU platforms.
Training-Free Controllable Generation: SoundCTM supports training-free controllable sound generation, leveraging its anytime-to-anytime jump capability to optimize initial noise with significant efficiency.

Implications and Future Developments

The introduction of SoundCTM holds several implications for both practical applications and theoretical developments in sound generation:

Real-Time Sound Synthesis: By addressing the issue of slow inference, SoundCTM can significantly enhance the efficiency of sound creation workflows, making it a valuable tool for Foley artists and multimedia content creators.
Versatility Across Modalities: The domain-agnostic nature of the proposed framework suggests potential applicability to other modalities beyond sound, paving the way for broader adoption in multimedia generation tasks.
Dynamic Sound Generation: The ability to achieve real-time dynamic sound generation opens new possibilities for live performances, interactive exhibitions, and immersive video game experiences.

Future research could further explore the integration of SoundCTM with other state-of-the-art models and techniques, as well as potential applications beyond the current scope. Enhancing the interpretability of the generated sounds and improving the robustness of the framework in diverse environments are also promising directions.

In conclusion, SoundCTM presents a significant step forward in the evolution of sound generation models, offering a blend of flexibility, efficiency, and high-quality output. The paper provides valuable insights and practical solutions that address key challenges in the field, making it a notable contribution to the ongoing development of advanced sound synthesis technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1796188619662610456

https://twitter.com/AudioAndSpeech/status/1796035945864499367

https://twitter.com/gastronomy/status/1796031024314437973