TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Published 30 Dec 2024 in cs.SD, cs.AI, cs.CL, and eess.AS | (2412.21037v2)

Abstract: We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for LLMs. To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces TangoFlux, a 515M-parameter model that rapidly generates up to 30 seconds of high-quality audio in just 3.7 seconds.
It employs a novel flow matching framework with CLAP-ranked preference optimization (CRPO) to enhance audio-text alignment and generation fidelity.
Empirical evaluations show superior audio quality and relevance through both objective metrics and human assessments compared to competing TTA systems.

An Overview of "Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization"

Introduction

The recent advancements in text-to-audio (TTA) generation encompass innovative methodologies geared towards automatic creation of audio content from textual inputs. The paper under examination introduces "TangoFlux," a TTA model, addressing the challenges of efficient alignment and generation speed in TTA systems. Utilizing a novel framework named CLAP-Ranked Preference Optimization (CRPO), the research aims to enhance model performance through iterative generation and optimization of preference data, demonstrating improved alignment in generated audio outputs.

TangoFlux: Model Architecture and Features

TangoFlux is characterized by its efficient architecture, comprising a 515 million parameter TTA model. A standout feature is its ability to generate up to 30 seconds of 44.1kHz audio rapidly, with an inference time of just 3.7 seconds on a single A40 GPU. The model leverages a unique combination of Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT) blocks, pre-trained and fine-tuned under controlled conditions to ensure audio quality and prompt relevance. Additionally, the model implements flow matching with rectified paths to streamline training dynamics and minimize computational cost while preserving audio fidelity.

CRPO: Enhancing Alignment and Optimization

A central challenge in TTA model alignment is generating preference pairs without structured rewards or gold-standard answers, as seen in LLMs. CRPO addresses this by employing a novel ranking mechanism using a CLAP model as a proxy reward system. This process involves iterative cycles of data sampling, preference pair generation, and optimization—functions as a self-improving loop. Empirical tests show that CRPO-generated datasets outperform pre-existing alternatives like BATON and Audio-Alpaca, thus streamlining preference alignment for enhanced model accuracy and reliability in representing text prompts.

Evaluation Metrics and Results

TangoFlux's effectiveness is validated through rigorous objective and subjective metrics across benchmark datasets, including AudioCaps. Notably, TangoFlux achieves superior performance in metrics such as Fréchet Distance (FD$_{\text{openl3}$), Kullback-Leibler divergence (KL$_{\text{passt}$), CLAP score, and Inception Score (IS). Furthermore, the model showcases remarkable efficiency by maintaining high performance even at reduced sampling steps, demonstrating an excellent balance between computational needs and output quality.

The subjective evaluation reinforces TangoFlux's utility, as human evaluators consistently rated its outputs higher in both overall audio quality and text relevance compared to competing models. The model's ability to handle complex, multi-event prompts is particularly highlighted, suggesting its practical applicability in diverse multimedia applications.

Implications and Future Directions

The development of TangoFlux illustrates significant strides in TTA technology, potentially transforming audio content creation in creative and commercial sectors by streamlining production workflows and enhancing creative flexibility. The framework's open-source release aims to stimulate further exploration and refinement in TTA research, encouraging innovations that could extend its applications to more nuanced auditory and linguistic contexts.

Moving forward, future research could explore integrating additional contextual information to bolster model adaptability across a broader range of audio styles and environmental noises. The iterative preference optimization method could also be expanded to encompass real-time feedback systems, diversifying the model's capability to learn from diverse interaction scenarios.

In summary, the paper highlights the societal and technical impacts of advanced TTA models, setting the stage for future breakthroughs in seamlessly converting textual cues into vivid and immersive audio experiences. The TangoFlux exemplar, through its innovative methodology and proven efficacy, contributes a robust foundation to this evolving domain.

Markdown Report Issue