EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer (2409.10819v1)

Published 17 Sep 2024 in eess.AS and cs.SD

Abstract: Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-LLMs for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.

PDF Abstract

The paper "EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer" introduces an innovative approach to text-to-audio (T2A) generation by presenting a model named EzAudio, which addresses several long-standing challenges in the field, such as generation quality, computational cost, diffusion sampling, and data preparation.

Key Innovations

Waveform VAE Latent Space

EzAudio operates in the latent space of a 1D waveform Variational Autoencoder (VAE). This allows the model to avoid the complexities and computational burdens associated with 2D spectrogram representations. Moreover, by eliminating the need for an additional neural vocoder, the model architecture is significantly simplified, making the T2A pipeline more efficient and easier to manage.

Optimized Diffusion Transformer Architecture

The researchers developed an optimized diffusion transformer architecture specifically tailored for audio latent representations. This design improves convergence speed, training stability, and memory usage. Key features of this architecture include adaptive layer normalization (AdaLN), long-skip connections, Rotary Positional Encodings (RoPE), and QK-Norm, all of which contribute to enhanced performance during training and inference.

Data-Efficient Training Strategy

To combat data scarcity, the authors propose a three-stage training strategy:

Leveraging unlabeled data for learning acoustic dependencies.
Utilizing audio caption data annotated by audio-LLMs for text-to-audio alignment learning.
Fine-tuning with human-labeled data.

This integrated approach significantly improves generation quality and prompt alignment, making the model robust even with limited labeled data.

Classifier-Free Guidance (CFG) Rescaling

EzAudio introduces a novel Classifier-Free Guidance (CFG) rescaling method. This technique simplifies the process of achieving strong prompt alignment while preserving high audio quality, even at larger CFG scores. By using this approach, the model mitigates the traditional trade-offs between guidance strength and generative quality, resulting in superior performance in both dimensions.

Experimental Results and Metrics

The experimental validation reveals that EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations. Tested on the AudioCaps dataset, the EzAudio-XL model achieved leading-edge results with specific metrics: Frechet Distance (FD) of 14.98, Kullback-Leibler (KL) divergence of 1.29, Inception Score (IS) of 11.38, and CLAP score of 0.314. These scores illustrate the model's proficiency in generating realistic and high-quality audio outputs closely aligned with text prompts.

Implications and Future Developments

The practical implications of this research are significant. EzAudio's streamlined architecture and efficient training strategy make it highly accessible for further development and deployment. By releasing the code, data, and pre-trained models, the authors facilitate advancements in high-quality T2A systems, potentially benefiting both academic researchers and industry practitioners.

From a theoretical perspective, the adoption of waveform latent spaces and the refined diffusion transformer model could influence similar implementations in related domains such as video-to-audio synthesis and voice generation. The advanced CFG rescaling technique provides a robust framework for balancing prompt alignment and output quality, serving as a potential cornerstone for future research in generative modeling.

Conclusion

The paper presents a comprehensive paper on advancing T2A generation through an efficient diffusion transformer model. The innovations in architectural design and the data-efficient training strategy set a new benchmark in the field by delivering highly realistic audio outputs with strong text prompt alignment. Future research may expand on this foundation to explore broader applications in audio generation, such as integrating ControlNet components and creative use cases in music and speech synthesis. The release of this research is poised to foster significant advancements in T2A generation and beyond.